Princeton-CDH / pemm-scripts

scripts & tools for the Princeton Ethiopian Miracles of Mary project
Apache License 2.0
1 stars 0 forks source link

As a researcher, I want my incipit searches to take common synonyms into account so that my search results are more relevant. #27

Closed thatbudakguy closed 4 years ago

thatbudakguy commented 4 years ago

notes

initial request from @WendyLBelcher posted in #25:

  • The synonym file should be incorporated into the Incipit Tool. That is, additional characters/words that should be treated as interchangeable by the Inicipit Tool. We imported a lot of things from Hamburg Github, but this is an additional list of things like "one" as interchangeable with "1". https://docs.google.com/spreadsheets/d/1GSMPHV6npdXvlfe_kTfUKBoFZ64DgnH6hoFYqkn0xRE/edit#gid=0

  • Just as a note about the issue, no task, one problem that arose with the synonym file is that "the synonyms only work with full words", so I stripped the synonym file of things that weren't full words, specifically: በ- ዘ- ዝ- ዛ-. To explain further, በ- ዘ- ዝ- ዛ- would always appear at the beginning of words, that is, with spaces in front of them. And the hyphen would not be typed. This is one issue with Ge`ez, as an Afro-Asiatic language, everything is formed as part of one word. So, "he gave it to her" is one word

See also: Solr documentation on synonyms https://lucene.apache.org/solr/guide/6_6/managed-resources.html#ManagedResources-Synonyms

rlskoeser commented 4 years ago

I forgot to respond to the comment about synonyms on #21 — moving here since this issue is specific to the synonyms.

Fifth, re synonyms, maybe it does help to have them. Maybe, in particular "1 and one" of "፩ and አሐዱ፡" which will be the most common substitution. Or, did you say you can't do it with single characters?

Here's the synonym configuration file I created based on the spreadsheet you provided. (I added the corresponding arabic numerals.) I think that they are working as synonyms but not highlighting the synonym word, but it's a little hard to tell. This might be another one you could test more explicitly by manufacturing some examples while you're testing #30

WendyLBelcher commented 4 years ago

My only question is about the Arabic numerals. I'm not sure if we should do that? For instance, does that mean that the 19 in this incipit would be converted into Ge'ez: ሀለወት፡ አሐቲ፡ ብእሲት፡ ዘቦአት፡ ው(f. 19vb)ስተ፡ ቤተ፡ ሞቅሕ፤ ዘውእቱ፤ ጸማዕት፡ ይእቲ፤ ብእሲት ፨ ወሖራ፤ {er. } ኃቤሃ፡ ክልኤ፤ አሐተ፤ ውርዝዋት፤ በስኖን፤ ወልሂቃት፤ በምግባሮን፤ ከመ፤ የሐውጻሃ፤ ወሶበ፤ ርእየቶን፤ ተአምኃቶን፨…

rlskoeser commented 4 years ago

@WendyLBelcher I think it only does individual tokens, which right now is only 1-5. It doesn't convert it exactly, it just searches on all variants of that word.

I don't think there's likely to be any harm in including it, but I'd be glad to remove it because I'm not sure how helpful it is either.

WendyLBelcher commented 4 years ago

Okay, awesome. Let's leave as is, and I'm closing this issue.