Tibetan TDZ files are being loaded as Pali

ayya-vimala commented 6 months ago

When running https://buddhanexus2.kc-tbts.uni-hamburg.de/api/menus/files/?language=pli it returns all pali files as requested, but also all the Tibetan TDZ files.

The original as mentioned in tib-files.json is:

    {
        "category": "TZ36", 
        "textname": "Terdzo-CHI-001", 
        "filename": "TDZ-Terdzo-CHI-001", 
        "link": "https://rtz.tsadra.org/index.php/Terdzo-CHI-001", 
        "displayName": "zab mo gsang ba yongs 'dus las:_mkha' 'gro sprul sku snying thig gi tshogs mchod las byang rin chen snye ma ", 
        "filenr": 6174
    },

However, this gets returned as a pali file with the wrong category:

{
            "displayName": "zab mo gsang ba yongs 'dus las:_mkha' 'gro sprul sku snying thig gi tshogs mchod las byang rin chen snye ma ",
            "textname": "Terdzo-CHI-001",
            "filename": "TDZ-Terdzo-CHI-001",
            "category": "TD",
            "available_lang": null,
            "search_field": "zab mo gsang ba yongs 'dus las:_mkha' 'gro sprul sku snying thig gi tshogs mchod las byang rin chen snye ma  zab mo gsang ba yongs 'dus las:_mkha' 'gro sprul sku snying thig gi tshogs mchod las byang rin chen snye ma  Terdzo-CHI-001"
        },

It seems to me that the failure to retrieve the wrong category has to do with the category not being able to be derived from the segment numbers in api/utils/ function get_cat_from_segmentnr. It makes me wonder if this function is useful at all and if we cannot just get the category from the filenames as mentioned in the datafiles in the first place.

Also, these files are not mentioned in the data/tib-collections.json and in data/tib-categories.json the wrong filenames are mentioned. There the filenames are like "TZ36-CHI-001". Also in the latter there are some wrongly formatted filenames like "TZ53~-'I-001"

sebastian-nehrdich commented 6 months ago

Vladimir should have a PR on the Tibetan metadata to remove the wrong entries from the files. I think for now we have to update the regexes to make sure that the language is identified correctly, the dataloader relies on this (it has no access to the menu files when loading the matches, so we need to be able to infer the language identity from the filename/segmentnr alone without relying on external data sources for that). I will look into the get_cat_from_segmentnr code here.

ayya-vimala commented 2 weeks ago

To check after BE upload

BuddhaNexus / buddhanexus

Tibetan TDZ files are being loaded as Pali #225