Open emylonas opened 3 years ago
test_lemmatized.csv
only has 30 rows.This is the real actual csv file. CSV output June 16.xlsx
@birkin Should I add a LATIN_CSV_NEWER_URL
or do you think it's safe to replace one of these?
@atbradley since I don't know if it's safe, I'd say go the 'newer' route.
I rearranged the columns in the new file to match the current one, uploaded it, and changed the settings file on dlib so it's looking at the new .csv. You can see the results at https://dlibwwwcit.services.brown.edu/iip/wordlist/latin/.
I think at least part of the problem is that the new file is missing a couple of columns compared with the existing one:
POS Tagger Word
, which I think is the original word fed into the lemmatizer/tagger. The csv includes the original XML element the word came from. Can we recreate the word by just regexing the XML tags away?Part of Speech (Secondary Info)
, which contains noun cases and something else for verbs.This works well, except that when you expand a lemma and see all the instances of that word, the link back to each inscription is wrong. It only contains only the four initial letters, but not the following four numbers and possible letter. I think there is a truncation error, because it's not a problem in the data.
It looks as if the HTML element is being generated here: https://github.com/Brown-University-Library/iip-production/blob/main/iip_smr_web_app/templates/wordlist/latin_wordlist.html l. 105.
I can't tell where the value of inscription_url
is coming from. The csv source file has something like caes0123.xml
but the resulting URL has caes
which then causes the URL to fail.
The wordlist is here. this is true if you click on any lemma and then try to follow the inscription link in light blue.
The Word List part of the IIP website is generated by code that is in this repository (iip-production). There is a
wordlist.html
template in the templates directory: https://github.com/Brown-University-Library/iip-production/tree/main/iip_smr_web_app/templates/wordlistThere are other bits of wordlist code in /iip_smr_web_app/libs/wordlist/wordlist.py For some reason, this file calles LATIN_CSV_NEW_URL throughout except for line 208, where it calls LATIN_CSV_URL, which I think points to an older file. This may be fine, but worth checking the discrepancy.
The CSV source file is https://github.com/Brown-University-Library/iip-production/tree/main/iip_smr_web_app/templates/wordlist It's called by the global variable
LATIN_CSV_NEW_URL
or `LATIN_CSV_URL' which are set in /iip_smr_web_app/settings_app.pyThe new Latin CSV file, which has the lemmatization provided by the machine learning group in France, is attached: test_lemmatized.csv
This is formatted so it's very similar to the original CSV files but might need to have the columns adjusted a bit. I can help with that.
Do you need any more details? Mina documented the process in the https://github.com/Brown-University-Library/iip-word-lists repository. This code is used to derive the CSV file from the IIP inscription files, but is not aware of the word segmentation work that has been done to make the whole parsing process easier. There are 4 steps that handle segmentation and various types of lemmatization and POS tagging. The step 4 CSV output is the file that is used as input to the wordlist diaplay code. So you don't need this repository and code base. However, Mina may have explained some of the steps for generating the display in the README file.