Make Word Lists work with new Latin data

emylonas commented 3 years ago

The Word List part of the IIP website is generated by code that is in this repository (iip-production). There is a wordlist.html template in the templates directory: https://github.com/Brown-University-Library/iip-production/tree/main/iip_smr_web_app/templates/wordlist

There are other bits of wordlist code in /iip_smr_web_app/libs/wordlist/wordlist.py For some reason, this file calles LATIN_CSV_NEW_URL throughout except for line 208, where it calls LATIN_CSV_URL, which I think points to an older file. This may be fine, but worth checking the discrepancy.

The CSV source file is https://github.com/Brown-University-Library/iip-production/tree/main/iip_smr_web_app/templates/wordlist It's called by the global variable LATIN_CSV_NEW_URL or `LATIN_CSV_URL' which are set in /iip_smr_web_app/settings_app.py

The new Latin CSV file, which has the lemmatization provided by the machine learning group in France, is attached: test_lemmatized.csv

This is formatted so it's very similar to the original CSV files but might need to have the columns adjusted a bit. I can help with that.

Do you need any more details? Mina documented the process in the https://github.com/Brown-University-Library/iip-word-lists repository. This code is used to derive the CSV file from the IIP inscription files, but is not aware of the word segmentation work that has been done to make the whole parsing process easier. There are 4 steps that handle segmentation and various types of lemmatization and POS tagging. The step 4 CSV output is the file that is used as input to the wordlist diaplay code. So you don't need this repository and code base. However, Mina may have explained some of the steps for generating the display in the README file.

atbradley commented 3 years ago

Is there a larger file? test_lemmatized.csv only has 30 rows.
What is each column in this?

emylonas commented 3 years ago

This is the real actual csv file. CSV output June 16.xlsx

atbradley commented 3 years ago

@birkin Should I add a LATIN_CSV_NEWER_URL or do you think it's safe to replace one of these?

birkin commented 3 years ago

@atbradley since I don't know if it's safe, I'd say go the 'newer' route.

atbradley commented 3 years ago

I rearranged the columns in the new file to match the current one, uploaded it, and changed the settings file on dlib so it's looking at the new .csv. You can see the results at https://dlibwwwcit.services.brown.edu/iip/wordlist/latin/.

I think at least part of the problem is that the new file is missing a couple of columns compared with the existing one:

POS Tagger Word, which I think is the original word fed into the lemmatizer/tagger. The csv includes the original XML element the word came from. Can we recreate the word by just regexing the XML tags away?
Part of Speech (Secondary Info), which contains noun cases and something else for verbs.

emylonas commented 3 years ago

This works well, except that when you expand a lemma and see all the instances of that word, the link back to each inscription is wrong. It only contains only the four initial letters, but not the following four numbers and possible letter. I think there is a truncation error, because it's not a problem in the data.

It looks as if the HTML element is being generated here: https://github.com/Brown-University-Library/iip-production/blob/main/iip_smr_web_app/templates/wordlist/latin_wordlist.html l. 105. I can't tell where the value of inscription_url is coming from. The csv source file has something like caes0123.xml but the resulting URL has caes which then causes the URL to fail.

The wordlist is here. this is true if you click on any lemma and then try to follow the inscription link in light blue.

Brown-University-Library / OLD-ARCHIVED_iip-production

Make Word Lists work with new Latin data #136