FreeUKGen / MyopicVicar

MyopicVicar (short-sighted clergyman!) is an open-source genealogy record database and search engine. It powers the FreeREG database of parish registers, the FreeCEN database of census records, the next version of FreeBMD database of Civil Registration indexes and other Genealogical applications.
46 stars 15 forks source link

21 Document rules for a ucf match #1135

Closed Captainkirkdawson closed 4 years ago

Captainkirkdawson commented 7 years ago

Priority 21 (1 4 10 6) What are the rules for a ucf match? I would have expected the following to be suggested as a match

<SearchName _id: 58c47930231040110c23cbbf, first_name: "james", lastname: "she{2,3}er", origin: "transcript", role: "g", gender: "m", type: "p">

"name has wildcard" "SHEER" /she.{2,3}er/ "did not add"

Captainkirkdawson commented 7 years ago

Similarly would a search for GRANT or GRAENT not be expected to genrate a ucf match for a record with GRA_NT? It does not

Sherlock21 commented 7 years ago

Have i missed a line in this thread? How is the search process meant to know that the ? on the end of a name in the dataset should trigger an " all possible combinations of anything like the specified search name" result.

All the ? on a name in the dataset means to me is that the transcriber does not have 100% confidence in their transcription but that GRANT is their best stab, indicated as GRANT?. I would expect the search to just ignore the ? character and rather than only returning GRANT?, to return GRANT (if that was searched for) in the first section of results, and any GRANT? names in the 'Possibles' section.

Captainkirkdawson commented 7 years ago

Eric this has nothing to do with the use of the ? that appeared as a gramatical construct in my original post which I have now removed for clarity! The thread is about what we expect to happen for transcriptions that include ucf. After the latest set of mods by Ben we do as you suugest ignore the ? in the search but include it in the search result. ie a record of GRANT? is retrieved by a search for Grant

My question was to do with expectation for k*k or Grant or she{2,3}er

For kk I would have expected a search for kik or Kirk or Kluck to return a possible ucf match of the kk transcription. The latest code generate the ucf for kk kik but not for kirk or kluck. Clearly the latter 2 should have been shown as * means 1 or more. But we clearly also generate it for 0. Is that a valid expectation. As far as the spec is concerned it is not but should we change the spec.

For Grant I would expect Grannt Graent to show the record and it does.; as is one character. But unlike * it does not on the 0 ie. Grant does not suggest that ucf. It is not required to according to spec

she_{2,3}er does exactly as we would expect from the spect. ie she--er or she---er generate a possible match for that record. sheer and she-er and she----er do not. That is the spec but I am asking about expectation rather than spec.

In summary. and {2,3} work according to spec but he question is that what peoples expectations would be. The * as a ucf is not handled correctly and will be raised as a separate story

richpomfret commented 7 years ago

To ask testing group for this?

benwbrum commented 7 years ago

I think this story needs to be a documentation story explaining how the new UCF search works.

Captainkirkdawson commented 7 years ago

Agree on that but there was also the question on expectation. Should be 0-n (currently 0-1) and should _ be 0-1. I believe we agreed that should be 0-n.

I also feel that _ should be 0-1

benwbrum commented 7 years ago

I feel that * should be 1-n, though can be persuaded that it should be 0-n. _ should be exactly 1, in my opinion.

We'd want to review the UCF instructions to resolve this.

Sherlock21 commented 7 years ago

In my view as a Transcriber, _ is defined as meaning positively one character only.

And * means anything ( number or Letter) of any quantity of characters but not none. So how that matches your syntax of 0-n or 0-1 I leave you to translate it

benwbrum commented 7 years ago

I have written up a document based on the latest documentation at https://docs.google.com/a/freeukgenealogy.org.uk/document/d/14iIoZtEjfN_CgDUwc6X9qjwUQIxsnHK568QseblTNRc/edit?usp=sharing

And would appreciate a review and discussion of whether or not to post it for transcribers. In addition, I've added wording about UCF to the about this search screen when it's in use.

edickens commented 7 years ago

Ben, I like your approach and the way it could work. My suggestions are to do with the way that we tell people. Instead of saying "Myopicvicar" (I term which I am not keen on as it is implies a short sighted or short term planning member of the clergy) say "Server" or "Website". I think that "Simple UCF" (those entries with both types of brackets) should be processed as you suggest, but all the "Wildcard" versions just do not get processed. This makes a clear distinction. The "Wildcard UCF" will only be found when specifying County and Place. Or you could even have an "Advanced Search" option where just the "Wildcard UCF" are displayed when a Wildcard is such as "T*" is used in the search parameters. Do you think you should think up a different word for "Widlcard UCF"? I understand the word "Wildcard" to be the Asterisk or Question Mark that you enter into your Search parameters. Agreed that in UCF they have the same function, but they should not be confused. ED

Captainkirkdawson commented 7 years ago

Ben

I agree with EricD that the use of wildcard in 2 very different contexts is extremely confusing.

I also agree with him that we should not MyopicVicar no one knows what that is or means.

We should restrict the use of wildcard to its meaning in the search not within the UCF where * and _ are UCF special characters

I think the document would be better going from the simple to the more complex. As it is it tends to jump around,

K

AlOneill commented 7 years ago

@benwbrum @edickens @SteveBiggs This is what I propose to add to the Transcriber Help (the Researcher Help will take some more thinking!).

Please do not use square brackets, [ ], in any of the Forename or Surname fields, unless you need to use the brackets as part of our UCF.

For example, you may be tempted to enter something like "Willam [sic]" or even "[Willam]", just as you you would in a transcription made for your personal use. For the FreeREG database to be easily searchable by a researcher, you need to put "Willam" in the Forename field and then something like "Forename: Willam [sic]" in the Notes field. (Ideally, the comment would go in a Transcriber Notes field, but we do not have this field yet.)

For details of how square brackets and our UCF affects search results, see ... (link to info in Researcher Help — once written!).

edickens commented 7 years ago

Good idea. In fact nothing else should be added to names, for example "Snr." otherwise the search thinks it is a second name. All goes in the notes.

SteveBiggs commented 7 years ago

I was going to say the same thing as Eric - the name fields must only contain the proper name(s) with no title, rank, subscript, etc - these must all go in the Notes.

AlOneill commented 7 years ago

Thanks. The Help already covers the general idea of 'name only', but I will review wording/placement of instructions when I make the updates.

richpomfret commented 6 years ago

@AlOneill Are we happy with this? If so @benwbrum and @Captainkirkdawson to review possible performance issues prior to implementation..

Captainkirkdawson commented 4 years ago

The rules of the UCF and how we react were first documented by Ben and have been updated in the following document: https://docs.google.com/document/d/14iIoZtEjfN_CgDUwc6X9qjwUQIxsnHK568QseblTNRc/edit#

Captainkirkdawson commented 4 years ago

A test file https://test3.freereg.org.uk/freereg1_csv_files/5e45afbbe9379074c4382daf?locale=en conatins a number of different UCF and could be useful in guiding any testing

PatReynolds commented 4 years ago

@Captainkirkdawson please tell me which parish this file relates to so that I can conduct tests (using the Unique Name feature to identify what I should and shouldn't find, and data on types of records and dates.

Captainkirkdawson commented 4 years ago

@PatReynolds if you follow that link you will see that it takes you to SOMRUNBA (Captainkirk) in Parish Register of St Peter in Runnington of Somerset

Captainkirkdawson commented 4 years ago

Update document to address comments by @PatReynolds. Also added a section on how dates containing UCF will be treated in searches containing a date range https://docs.google.com/document/d/14iIoZtEjfN_CgDUwc6X9qjwUQIxsnHK568QseblTNRc/edit#

PatReynolds commented 4 years ago

Thanks, Kirk that is excellent! I got a bit lost in dates, you can tell, but otherwise great. I've suggested changing the language from talking about 'the researcher' to talking to 'you'. And a suggestion on 'nearby places' (if nearby places doesn't work as I think it does, we need to say that 'nearby places' cannot be selected).

AlOneill commented 4 years ago

@Captainkirkdawson @PatReynolds I've made a start. It would help me if the unresolved comments — mainly about Dates — in Kirk's document could be dealt with. Thanks!

Captainkirkdawson commented 4 years ago

@AlOneill Have updated the text in my document . Records with dates that contain UCF characters will not be included in the results if a date range is applied to a search. Records that contain UCF characters will be retrieved in a search without a date range

SteveBiggs commented 4 years ago

I've made a few suggested changes for clarification and have a question about dates:

Why is UCF not used in a date range search? For example; '162[38]' must between 1620 and 1630 so why can't such a date range search return it?

AlOneill commented 4 years ago

Thanks @Captainkirkdawson

@all As part of describing the possible misuse of square brackets, I intend to ask researchers to report such problems — this may result in a deluge of error reports, but I don't think we can dodge the issue!

AlOneill commented 4 years ago

As I work on the text it occurs to me that some of the subtleties of UCF may be lost for anyone who relies on a screen-reader. (Punctuation is not voiced, typically.) Will have to test.

AlOneill commented 4 years ago

@Captainkirkdawson In the section on the misuse of square brackets, I am a little puzzled by this example as I thought wildcard searches applied only to surnames —

Is the solution to make it an example about the surname, "*JOHN*" ?

Captainkirkdawson commented 4 years ago

Happy for you to make that change

AlOneill commented 4 years ago

@Captainkirkdawson Ah, just checked that nothing has changed (which is hasn't on t3): there must be 2 letters before a *, so will drop that (surname) example.

AlOneill commented 4 years ago

Draft Help page ready for review.

There is probably room for improvement, but I reckon the essential information is there.

Will create a new issue to check that info and results are accessible for screen-reader users.

Captainkirkdawson commented 4 years ago

The section being entitled "Interpreting symbols in names and dates" and 3/4 down in the Help and referenced from the sidebar as Symbols in Your Results will never be read. The point is made that If you search a specific place within a county, then we are able to show you any results that could match what you are looking for: we search initially for exact matches and then for any records containing UCF characters that could also match the search name. ie a specific place search will now have an extra section with those extra results. You will NOT get them in a county wide search. This needs to be identified in the paragraphs on Name Variations with links to this new section. (At least in my opinion) As written those sections simply say use wildcard or soundex The content itself is fine.

AlOneill commented 4 years ago

@Captainkirkdawson Fair point — I think I understand what you mean! Will review wrt your comments.

AlOneill commented 4 years ago

On reflection, cross-references are also needed for dates. And likely needed for Unique Names listing — I seem to remember that UCF is shown in these lists — but will check.

Moving back to In progress.

AlOneill commented 4 years ago

Cross-referencing added for Names, Dates and Unique Names.

Captainkirkdawson commented 4 years ago

@AlOneill I am happy for this to be finalized and made ready to deployment.

AlOneill commented 4 years ago

Help page now ready for deployment.

Captainkirkdawson commented 4 years ago

Deployed to production on 20 June 2020