Pali data calculations errors

ayya-vimala commented 8 months ago

Pali data calculations have to be redone due to two errors

sn, an, and dhp files created problems due to their numbers. This has been corrected in the repository: https://github.com/BuddhaNexus/segmented-pali/tree/master/inputfiles_cut_segments_on_space
the pali parallels calculations have cut off part of the numbers so calculations do not match actual segments. For instance, atk-s0101a:1271_0 up to atk-s0101a:1271_0 are all rendered as atk-s0101a:1271 so then the correct segment can no longer be found. I've been trying to find where the error occurs; in the json.gz files or if it is cut off somewhere during dataloading but my computer is too slow to open the json.gz files so it's hard for me to check.

angirov commented 8 months ago

Thank you. I'll look into it soon

ayya-vimala commented 8 months ago

I found a way to look at the source json.gz files and it looks like the _[0-9]+ segment number extensions have been removed so that is why it doesn't find those.

sebastian-nehrdich commented 8 months ago

@angirov What @ayya-vimala describes in the last comment could indeed be a problem in our code that generates the segmentnrs on the dvarapandita project. For Chinese we remove _[0-9]+ since that is folio-specific information which we don't want for Chinese, but appearently we need it for Pali, so this code needs to be adjusted to make sure that we don't run into this problem on Pali files.

ayya-vimala commented 2 months ago

I think this is solved now.

sebastian-nehrdich commented 3 weeks ago

can we review this? I don't know if this still applies

ayya-vimala commented 3 weeks ago

I think this issue is solved.

BuddhaNexus / buddhanexus

Pali data calculations errors #232