Closed gissoo closed 2 years ago
Increasing estimate from 3 to 5 to account for language ISO code work
@mrustow @richmanrachel I've been working on transcription search & keywords in context and have some questions about language (particularly language metadata & transcription language).
Some context: ideally, we need to add a language code attribute in the html when we display the transcription text to differentiate from whatever the default language is for the rest of the page content (e.g., English). This is particularly important for search engines and screen readers.
I've revised our Language+Script
model to add a field for ISO Codes, and I've written a migration that will populate the codes for languages used on documents that currently have transcriptions (and it will be viewable & editable in admin).
I was thinking that I should be able to use the primary language field for the code of the transcription — but in a lot of cases, that isn't set for documents with a transcription, and in other cases there are multiple primary languages.
My questions:
FWIW: probably only the general approach and multiple primary languages question are urgent; the rest are not blockers — if the language isn't specified, I'm currently setting the language attribute to an empty string to indicate it isn't the same language as the rest of the page. However, if there are multiple primary languages it's possible I'm setting the wrong value. (Which, maybe is fine to live with for now, as long as we have a plan to address.)
@rlskoeser - first for the testing rounds. 1) The transcriptions show up and they look good! (I think the font is still wrong, but the spacing and layout looks good to me). 2) Searching for the transcription text isn't perfect. While my second search worked, the first one pulled the records number (in what seems like an appropriate amount) but none of the docs showed up:
Yeah, this was the weird behavior I was seeing too. I was hoping it was something transitory! Can you tell if it's happening on anything besides transcription searches?
We haven't applied the fonts yet since they are still being finalized.
I'll keep testing some more things... but here I noticed that while Hebrew words were not getting highlighted in my last search, the Arabic word for God was successful:
Oh wait - @rlskoeser - it might only be highlighting from the description, not transcription....
Here Allah shows up in the transcription but is not highlighted...
Hmm, interesting. The transcription indexing right now is pretty simple and not language-specific (which is what we agreed on for the MVP), so it could be something related to that. We could look at the indexing analysis together at some point if that would be useful.
@rlskoeser - realizing I can still probably answer some of your questions now!
Am I thinking about primary language differently than you are? Does it make sense to use this as the language of the transcription?
- Yes, we are thinking about primary language differently, because for us the difference between Judaeo-Arabic and Hebrew really matters (so that the right researchers can look at the document) but for you they're probably the same (because they both use Hebrew characters), correct?
Why do some documents have so many primary languages? Any thoughts on how to label the language of transcription for these?
- Legal documents in particular use many languages because they are referring to legal precedents and arguments that took place over centuries from the Torah and Mishnah (Hebrew), Talmud (Aramaic), and local legal systems (Judaeo-Arabic, Greek, etc). I think script will probably be more helpful for you than language, as that will be more consistent.
It seems like a lot of documents with existing transcriptions do not have a primary language set. Can we remedy this? Is there any logic that would let us do a bulk update?
- Unfortunately there is no logic for a bulk update in regards to language. If we do need to just give you a bulk update for script, I think that assuming the writing is in Hebrew script will work for most documents that don't have "Arabic" in the description?
Do you know if there will be any major problems created by treating all Hebrew-script languages as "Hebrew" for the sake of the ISO? (Idk how they work in terms of trying to potentially correct spelling or anything).
The script matters for font and formatting, but the language does matter also. I want to be sure to tag Judaeo-Arabic differently from Hebrew when we know that's what it is. IDK if there are any screen readers that handle Judaeo-Arabic (kind of hard to imagine?), but telling them that it's Hebrew seems like a bad idea!
When we get to the point of customizing the search indexing to be language-specific, it will matter there too. e.g., for Judaeo-Arabic I'm hoping we'll be able to adapt the NLP work to convert to Arabic so it can be indexed and stemmed as Arabic, which should make the search more powerful.
This is a good reminder that it will be important for our permanent transcription solution to handle the language tagging within the text, since they can be so mixed!
Good to know bulk update doesn't make sense – I think that's ok, since we can at least mark it as different from the main text language. I should revisit how I'm handling texts with multiple languages, and I'll look into including script information so we can take advantage of that for formatting and display (which, as you point out, will be useful).
@richmanrachel could you test this again? I was trying to duplicate the weird behavior we saw before and can't; if you're able to, please document the search terms that cause problems.
I'm wondering if maybe there was a lag with synchronizing the solr configset change (now that we're using solr replication issue with solr cloud) when we were first testing it.
@rlskoeser - It's working better but still not fully. For example, the highlighting only seems to work on the first 10 results.
It's unclear if the last bit is working since the Mirador is pulling the wrong transcription text to check if the sample text is from the beginning.
ooh, good catch on the highlighting + pagination, it's entirely possible we're not doing something correct for non-first pages of results
were you able to duplicate any of the weird behavior we saw before?
Trying to repeat some of my earlier searches and see there's some larger break in logic after the 10th entry. Here, I typed in the Arabic word for God, and as you can see in the screenshot, up through #10, the results are correct and as expected. But after 10, it switches the Hebrew text with unclear logic:
I'm not having the same Solr issues as before with documents not showing up, thankfully.
And not all of the results after 10 are wrong, but they definitely don't have the highlighter feature...
@richmanrachel great, thank you — this is helpful.
@rlskoeser - You're welcome! Sorry I just get to point out the problems and don't know how to fix them, haha
@richmanrachel your insight about the highlighting stopping after the first ten and not working on subsequent pages helped me identify and fix the problem! Please confirm.
@rlskoeser - Hooray! I think it's working properly now (certainly the issue with 10 is resolved).
I'm just confused by your final test query. Could you please clarify?
if a keyword search term does not match a document with transcription text, you should see the excerpted text from the beginning of the transcription, as on the search without a keyword term
@richmanrachel if your search term matches a document somewhere but not the text of the transcription, then instead of keywords in context you should see the beginning of the transcription — this is what we show on search results that don't have keywords in context for transcription. Maybe you could test by searching for a hebrew term that occurs in a description rather than a transcription to see this? Or search on a tag that will bring back documents with transcriptions?
Basically, you're checking that in adding the keywords in context highlighting for transcriptions, we haven't broken the previous functionality.
@rlskoeser - there's rarely Hebrew text in the description that's not in the document, but I do think it's working? This is the last two entries from a search for אמת (truth) and it's correctly highlighting the search word (however the last entry doesn't have a transcription):
I think I feel comfortable closing, if you do?
I think it is working too — thanks for your careful testing. I think we're good to finally close this one. Hooray! 🎉
testing notes
test using the public document search on the test site: https://test-geniza.cdh.princeton.edu/en/documents/
dev notes
index_data
(exclude line numbers and labels)