Princeton-CDH / pemm-scripts

scripts & tools for the Princeton Ethiopian Miracles of Mary project
Apache License 2.0
1 stars 0 forks source link

As a researcher I want to search on high-confidence incipits so that I can check results that will be used in the sheets incipit lookup tool. #21

Closed rlskoeser closed 4 years ago

rlskoeser commented 4 years ago

dev notes

flask testing docs

rlskoeser commented 4 years ago

@WendyLBelcher when you get a chance, please provide a list of any words you'd like treated as synonyms when you search on the incipits (you mentioned that numbers are listed as both numerals as words in our last meeting).

WendyLBelcher commented 4 years ago

On it.

thatbudakguy commented 4 years ago

changes to make here as of 3/5 meeting:

WendyLBelcher commented 4 years ago

To explain the above check box: I am testing the incipit tool with Vita's incipits. Here are some things I found: First, the character exchange is working. So, ሐ is, in effect, the same as ኃ and it found it: አሐው፡ found አኃው However, it did not turn up any words with the original character, which might mean it doesn't exist, but more likely means that the code is saying something like: "if ሐ, find ኃ" rather than what it should have "if ሐ, find ሐ or ኃ" Possible?

WendyLBelcher commented 4 years ago

Second, the bolding is working now. We get it on more than one word and different words.

WendyLBelcher commented 4 years ago

-[ ] Treat "{...}" as no space at all? (Or, already is?)

To explain, it's amazing how it ignores all irrelevancies. So, dumping the whole incipit in, as is, doesn't harm it (including Latin letters and my accidentally including that first word, which is not part of the string). አሜን፨ ተብህለ፡ ሶበ፡ ፈቀዱ፡ አሐው፡ ከመ፡ ይትጋብኡ፡ ኀበ፡ መ(f. 46ra){lac.} {lac. [4/5 lines]}መ፡ አፍ{n.l. }ቶሙ፨ ወእም{n.l. }ራነ፡ ዘመድ፡ ይ{n.l. f. }፨ ፈራኂተ፡ እግ{n.l. }ዚአብሔር፡ ወመፍ{n.l. }ተ፡ ሰብእ፡ ኄራን፡ However, maybe we can treat as "no space" anything between {}? This is not at all urgent, so feel free to ignore. But, if yes, say, treating እግ{n.l. }ዚአብሔር as እግዚአብሔር

WendyLBelcher commented 4 years ago

Fourth, I'm studying this result. Not that anything is wrong, just trying to understand why certain sentences some up higher than others. I would have thought that the IDs with the first word first would have come up higher in the results. But, does 535 come up highest because ፈቀዱ፡ is a more unique term than ተብህለ፡? If so, that's cool. Screenshot testing incipit 3 6

WendyLBelcher commented 4 years ago

Fifth, re synonyms, maybe it does help to have them. Maybe, in particular "1 and one" of "፩ and አሐዱ፡" which will be the most common substitution. Or, did you say you can't do it with single characters?

WendyLBelcher commented 4 years ago

Sixth, it looks like the phrase boosting is really working, which is awesome. You can see up to four words in a row clustered Screenshot testing incipit 3 6 b

WendyLBelcher commented 4 years ago

Seventh, I'm analyzing the boosting of results with incipits that have a search word/phrase twice. to test, I did a search for the two words that probably appear together most in all of the incipits: ለእግዝእትነ፡ ማርያም (OurLady Mary). It yields lots of results, 260 incipits. In reality, one would never search only on these two words, they are too common. So, maybe it is irrelevant to analyse this search, but I noticed the following. The fact that a phrase (or word) appears twice in an Macomber incipit really shouldn't boost the score. If the phrase appears twice in the search bar, yes; but if only once, no. To give an example, if the search is for "OurLady Mary saves the sinner," it should not rate higher the result that has "OurLady saves the nun who called on OurLady Mary." Again, I'm not sure it matters, but just to note this. Screenshot testing incipit 3 6 c And, when I redo the search with one other word, it did put first the "right" one. Screenshot testing incipit 3 6 d

WendyLBelcher commented 4 years ago

Eighth, I need to test with Vita to properly understand what will work, so this will be my last observation. Jeremy truncated some of the incipits, leaving out "Mary" so if we return that word, it will help our searches. But, in this search, why is the score for Macomber 204 so much lower than the three Macomber 19s? 204 actually has more words. Screenshot testing incipit 3 6 e

WendyLBelcher commented 4 years ago

Nine, if it helps, I will explain one incipit search of three incipit searches that didn't quite work. This incipit below should yield a result of Macomber ID 138 ወኮነ፡ ፩ዲያቆን፡ ውስተ፡ ሀገረ፡ ደሴት፡ ወኮነ፡ ዘማዌ፡ ወብዙኅ፡ ኀጢአቱ፡ ወ(f. 9rb)ባሕቲቱ፡ ኮነ፡ ያፈቅራ፡ ለማርያም፡ እመ፡ ብርሃን፡ ወኮነ፡ ወትረ፡ ይጼሊ፡ እንዘ፡ ይብል፡ በትሑት፡ ልብ፡ በከመ፡ ብእሴ፡ ኃጥእ፡ ሰላም፡ Using too many words or too little words posed an equal problem. In this case, using the first five words yielded the best result. When I dump in the whole thing, 138 is not in the top ten results. When I dump in only the first two words, 138 is the eighth result. When I dump in the first three words, 138 is the fifth result. When I dump in the first four words, 138 is not in the top ten results again. When I dump in the first five words, 138 is the first result. When I dump in the first six, seven, eight, nine words, 138 is the second result. When I dump in the first fifteen words, 138 is the eight result. When I dump in the first twelve, thirteen, fourteen words, 138 is the seventh result. When I dump in the first eleven words, 138 is the third result.

The same type of problems happened with: ወሀለወት፡ አሐቲ፡ ብእሲት፡ ትነብር፡ ውስተ፡ ቤተ፡ ክርስቲያና፡ ለማርያም፡ ወታፈቅራ፡ ለእግዝእትነ፡ ማርያም፡ በኵሉ፡ ልባ፡ The right ID of 155 was "only" the second result.

And with: ብህለ፡ ሶበ፡ ፈቀዱ፡ አሐው፡ ከመ፡ ይትጋብኡ፡ ኀበ፡ መ(f. 46ra){lac.} {lac. [4/5 lines]}መ፡ አፍ{n.l. }ቶሙ፨ ወእም{n.l. }ራነ፡ ዘመድ፡ ይ{n.l. f. }፨ ፈራኂተ፡ እግ{n.l. }ዚአብሔር፡ ወመፍ{n.l.* }ተ፡ ሰብእ፡ ኄራን፡ The right ID of 185 was the seventh result. Removing bracketed material didn't help.

WendyLBelcher commented 4 years ago

Ten, wow! Out of 17 incipits we tested, 11 worked perfectly. That is, dumping in the entire incipit resulted in the first result being right!

Just in case it is useful, here they are: This incipit should yield a result of Macomber ID 139. ወሀሎ፡ ፩፡ መስተፅዕነ፡ ፈረስ፡ ዘስሙ፡ ኒቆዲሞስ፤ ኃጥእ፡ በኵሉ፤ ፍናዊሁ፡ ዓለማዊት፨ ወባሕቱ፤ ጸጋ፤ እግዚኣብሔር፤ ወስእለተ፡ እግዝእትነ፡ ማርያም፡ መርሐቶ፡ ኃበ፡ መድኃኒተ፡ ነፍሱ፡ ወነስሐ፡ በእንተ፡ ኃጢአቱ። When I dump the whole thing in, the first result is 139!

This incipit should yield a result of Macomber ID 145 ወሀሎ፤ ፩ቀሲስ፤ ውስተ፡ ፩፡ መካን፤ ኀበ፡ ሀለዉ፤ ብዙኀ፤ ሕዝብ፨ ወአል(f. 18ra)ቦ፤ ካልአ፡ ዘያአምር፤ ዘአንበለ፡ ቅዳሴሃ፤ ለማርያም፡ ወሠናይ፤ ኂሩቱ፡ ፈድፋደ፤ ለውእቱ፡ ቀሲስ፨ ወባሕቱ፤ ኮነ፤ ብእሲ፡ የዋሕ፨ ወኢየአምር፤ መጻሕፍተ፤ … When I dump the whole thing in, the first result is 145!

This incipit should yield a result of Macomber ID 170. ወሀሎ፡ ፩፡ መነኮስ፡ ውስተ፡ ደብር፡ ዘይትለአክ፡ ለእግዝእትነ፡ ማርያም፡ ተንባሊት፡ ወአሐተ፡ ዕለተ፡ እመዋ(f. 21rb)ዕል፤ ሰትየ፡ ወይነ፤ ወሠክረ፤ ወስእነ፡ ጸሎት፤ ምስለ፡ አሐው፨ ወሶበ፤ ነቅሐ፤ እምንዋሙ፤ ወተንሥኣ፤ ከመ፡ ይሖር፤ ኃበ፡ ቤተ፡ ክርስቲያን፨ When I dump the whole thing in, the first result is 170!

This incipit should yield a result of Macomber ID 425. ወሀለወት፡ አሐቲ፡ ቤተ፡ ክርስቲያን፡ በሀገረ፡ ሳም፡ ዘተሐንፀት፡ በስመ፡ እግዝእትነ፤ ቅድስት፡ ድንግል፤ በክልኡ፡ ማርያም፡ ወባቲ፡ ንዋየ፡ ብዙኃ፨ ወበአ(f. 23rb)ሐቲ፤ እመዋዕል፡ ተማከሩ፡ ፈያት፤ ከመ፡ ይሥርቁ፡ ንዋየ፡ ቤተ፡ ክርስቲያን፨ When I dump the whole thing in, the first result is 425!

This incipit should yield a result of Macomber ID 292. ስሙዑኬ፡ አሕዝበ፡ ክርስቲያን፤ ናይድዕክሙ፤ ዘንተ፡ ተኣምረ፡ ዐቢየ፡ ወመድምመ፨ ዘገብረት፡ በቤተ፡ ክርስቲያና፡ እግዝእትነ፡ ቅድስት፤ ድንግል፡ በክልኤ፡ ማርያም፡ በድብረ፡ ምጥማቅ፨ When I dump the whole thing in, the first result is 292!

This incipit should yield a result of Macomber ID 14. ስምዑ፡ አበውየ፡ ወአኃውየ፡ ከመ፡ ንንግርክሙ፡ ዘንተ፡ ተአምረ፡ ዓቢየ፨ ዘኮነ፡ ለእግዝእትነ፡ ቅድስት፡ ድንግል፡ በክልኤ፡ ማርያም፡ ወላዲተ፡ አምላክ፨ ዘነገሩነ፡ አበው፡ ቅዱሳን፡ ሰማዕትየ፨ እግዚአብሔር፡ ከመ፡ ኢይዌስክ፡ ወኢያነትግ፨ ወይቤሉ፡ ነበረ፡ ሦርያ፡ ፩ብእሲ፡ ለብሐዊ፡ ፈራሄ፡ እግዚአብሔር When I dump the whole thing in, the first result is 14!

This incipit should yield a result of Macomber ID 70 ወሀሎ፡ ፩ሲቀ፡ ጳጳሳት፡ በሀገረ፡ ሮም፨ ዘስሙ፡ ደናስዮስ። ወሶበ፡ ኃጥእዎ፡ ሕዝብ፡ ለብፁዕ፡ ማርቆስ፡ ሖሩ፡ ሀባ፡ ሊቀ፡ ጳጳሳት፡ ወይቤልዎ፡ ኃሠሥናሁ፡ When I dump the whole thing in, the first result is 70!

This incipit should yield a result of Macomber ID 85 ወሀሎ፡ ፩፡ብእሲ፡ ዘስሙ፡ ቢፋሞን፡ በሀገረ፡ አውሴም። ወቢፋሞንሰ፡ ሐይወ፡ እምንእሱ፨ በድንግልና፡ ወንጽሕ፨ ወፈጸመ፡ ገድሎ፡ When I dump the whole thing in, the first result is 85!

This incipit should yield a result of Macomber ID 153 አሜን፨ ወሀሎ፡ ፩ብእሲ፡ ዘስሙ፡ ዘካርያስ፨ ዘሠናይ፡ ላህዩ፡ በውስተ፡ ሀገረ፡ ሮሜ፨ ወቦአ፡ በአሐቲ፡ እመዋዕል፡ ውስተ፡ ቤተ፡ ክርስቲያን፨ ወነጸረ፡ ኀበ፡ ሥዕላ፡ ለማርያም፡ ወአፍቀራ። When I dump the whole thing in, the first result is 153!

This incipit should yield a result of Macomber ID 155 ወኮና፡ ክልኤ፡ አንስት፡ እንዘ፡ የሐውራ፡ ውስተ፡ ቤተ፡ ክርስቲያና፡ ለማርያም፡ ወእንዘ፡ የሐውራ፡ ውስተ፡ ፍኖት፡ ተንሥኡ፡ ላዕሌሆን፡ ፈያት፨ ወነሥኡ፡ ስንቆን፨ When I dump the whole thing in, the first result is 154!

This incipit should yield a result of Macomber ID 311 ወሀሎ፡ ፩፡ ብእሲ፡ እንዘ፡ የሐፅብ፡ አልባሲሁ፨ ወመጽአ፡ ካልእ፡ ብእሲ፡ ወሜጠ፡ ውእተ፡ ማየ፡ ሀበ፡ ካልእ፡ ፍኖት፡ ወይቤሎ፡ When I dump the whole thing in, the first result is 311!

WendyLBelcher commented 4 years ago

In the three remaining cases, the incipit search should result in a Macomber ID with a hyphen, 141-A, 35-A1, 35-A2. However, no matter what, those IDs never showed up in the results at all ever. That is, an ID with a hyphen never came up as any result. Which makes me think that the hyphen is throwing things off?

rlskoeser commented 4 years ago

@WendyLBelcher thank you for all your careful testing and detailed notes. I've gathered my comments and will respond to all your notes and questions here.

... check not "if ሐ, find ኃ" but rather "if ሐ, find ሐ or ኃ" So, ሐ is, in effect, the same as ኃ and it found it: አሐው፡ found አኃው However, it did not turn up any words with the original character, which might mean it doesn't exist...

I checked, and አሐው doesn't occur anywhere in the incipits we have. I get the same exact results when I enter either of those words, so I think it is working as expected. Feel free to test again, especially if you can find an example where we have two different variants. (Or you can manufacture your own test examples when you test #30 !)

Treat "{...}" as no space at all? (Or, already is?)

The existing search logic may ignore some of this content already; I don't think we should add code logic to strip it out, it seems better to me to have the person using the tool do that if necessary.

Does 535 come up highest because ፈቀዱ፡ is a more unique term than ተብህለ፡? If so, that's cool

It does! Solr relevance score uses TF-IDF (term frequency, inverse document frequency), so matches on rarer words count more. I tested your search terms individually, and on its own ፈቀዱ has just one match while ተብህለ has 54 matches.

Don't boost score if word appears twice.

I don't think we can do anything about this. This is the "term frequency" part of TF-IDF: terms that occur more frequently increase the score.

... why is the score for Macomber 204 so much lower than the three Macomber 19s? 204 actually has more words.

Looking at your screenshot, I think the relevance is because of the document length (shorter incipits will be listed first when search terms are otherwise the same)

... In the three remaining cases, the incipit search should result in a Macomber ID with a hyphen, 141-A, 35-A1, 35-A2. However, no matter what, those IDs never showed up in the results at all ever. I checked the test spreadsheet we're working with, and there are no incipits for any of those ids. It looks like it might be a problem with the incipit file you gave me - here's the version I'm using for the conversion/import. If you want to correct them before we run the conversion for real, please correct them on GitHub or let me know if I need to pull changes in from somewhere. Otherwise, you can add them into the spreadsheet later. (Feel free to use these missing incipits to test #14 and #30 !)


Based on your testing notes, I'm not seeing any problems with the search that we need to correct. (Please let me know if you disagree!) It seems like it's actually working quite well for a lot of your test cases! I don't think we can necessarily guarantee that the correct match will be the first result, no matter how much we refine the search logic; our goal should be that a researcher can find the correct match without too much trouble — ideally it would show up in the first 10 results. It sounds like you may have to edit your search terms in some cases, but that doesn't seem unreasonable to me.

WendyLBelcher commented 4 years ago

[ ] Make sure all incipits are picked up from file of canonical incipits and appear in spreadsheet This is the only incipit problem I still see. Once solved, we can close this issue.

For some reason, the incipits in Jeremy's file are not all coming through to the spreadsheet. Many of the ones with hyphens have no incipits in the spreadsheet, when they do appear in Jeremy's file

141 ወሀሎ፡ አሐዱ፡ ኤጲስ፡ ቆጶስ፡ በሀገረ፡ ስዒድ፡ ዘላዕላይ፡ ግብጽ፡ ኅሩይ፡ በኵሉ፡ ግዕዙ፡ ወሠናይ፡ ምግባሩ፡ ወርቱዕ፡ ፍናዊሁ፡ በሃይማኖተ፡ ክርስቶስ፡ ወኢያደሉ፡ ለገጸ፡ ሰብእ፡ በውስተ፡ ፍትሕ፡ በመዋዕለ፡ ሢመቱ EMML 3872
141 ወሀሎ፡ አሐዱ፡ ኤጲስ፡ ቆጶስ፡ ኄር፡ በኵሉ፡ ግብሩ፡ ወኢያደሉ፡ በመዋዕለ፡ ሢመቱ EMML 1573
141B ወሀሎ፡ አሐዱ፡ ብእሲ፡ እምሰብአ፡ ፌጻ፡ ዘስሙ፡ ጳሪቆስ፡ ወአሐተ፡ ዕለተ፡ ሖረ፡ ኀበ፡ አባ፡ ፊላታዎስ፡ ኤጲስ፡ ቆጶስ፡ ዘሰፈየት፡ ሎቱ፡ ሠቀ፡ እግዝእትነ፡…. ወተአምነ፡ ኀቤሁ፡ ኀጣውኢሁ EMML 2059
141B ተአምሪሃ፡ ለእግዝእትነ፡…. ጥእምተ፡ ስም፡ ንግሥተ፡ አርያም፡ መድኃኒተ፡ ኵሉ፡ ዓለም። ወእንዘ፡ ይነብር፡ ኤጲስ፡ ቆጶስ፡ አሐተ፡ ዕለተ፡ እምመዋዕል፡ መጽአ፡ ኀቤሁ፡ ብእሲ፡ ወተአምነ፡ ኀጢአቶ፡ ወኢተወክፎ፡ ውእቱ፡ ኤጲስ፡ ቆጶስ EMML 683
rlskoeser commented 4 years ago

@WendyLBelcher I corrected the incipit file you gave me based on your findings before, but I haven't done a fresh conversion/import to confirm that it worked. Do you want to test this again?

WendyLBelcher commented 4 years ago

that would be great.

rlskoeser commented 4 years ago

Closing based on past testing and conversation at project team meeting. (Reviewed the revised import with missing incipits and corrected one more stray id.)