Open ayya-vimala opened 6 months ago
Vladimir should have a PR on the Tibetan metadata to remove the wrong entries from the files. I think for now we have to update the regexes to make sure that the language is identified correctly, the dataloader relies on this (it has no access to the menu files when loading the matches, so we need to be able to infer the language identity from the filename/segmentnr alone without relying on external data sources for that). I will look into the get_cat_from_segmentnr code here.
To check after BE upload
When running
https://buddhanexus2.kc-tbts.uni-hamburg.de/api/menus/files/?language=pli
it returns all pali files as requested, but also all the Tibetan TDZ files.The original as mentioned in tib-files.json is:
However, this gets returned as a pali file with the wrong category:
It seems to me that the failure to retrieve the wrong category has to do with the category not being able to be derived from the segment numbers in api/utils/ function
get_cat_from_segmentnr
. It makes me wonder if this function is useful at all and if we cannot just get the category from the filenames as mentioned in the datafiles in the first place.Also, these files are not mentioned in the data/tib-collections.json and in data/tib-categories.json the wrong filenames are mentioned. There the filenames are like
"TZ36-CHI-001"
. Also in the latter there are some wrongly formatted filenames like"TZ53~-'I-001"