CLTS parsing issue causes transcription information to break when creating CLDF

cldf-clts / pyclts

Apache License 2.0

11 stars 2 forks source link

CLTS parsing issue causes transcription information to break when creating CLDF #47

Closed jcgood closed 1 year ago

jcgood commented 2 years ago

When converting data to CLDF, if the process of loading the BIPA transcription system in CLTS had an error, in this case due to duplicated graphemes and ultra-long not be associated with the models(?) of certain sounds (see Issue #45), this broke the process through which information was collected to write out the TRANSCRIPTION.md file and all words were treated as unsegmentable (even though they all were segmented in the CLDF). I patched this for my own work by changing some exceptions to warnings, especially since, for the data I was using, the problematic characters were not present in the data. (I didn't test this to see what happens if they were.)

sliedes commented 2 years ago

The parsing issue also breaks, for example, the CLTS.iter_soundclass() method:

>>> list(c.iter_soundclass())
[...]
ValueError: Unrecognized features (duration: ultra-long, line 129))

xrotwang commented 1 year ago

This problem only appears when using current pyclts with HEAD of cldf-clts/clts. I.e. HEAD of cldf-clts/clts contains data that current pyclts cannot deal with as of now.

The correct way to handle this is making sure to use compatible versions - and in particular to use released versions of the CLTS data, e.g.

cldfbench lexibank.makecldf ... --clts-version v2.2.0

with pyclts>=3.0.

You may also git checkout your data clone to an appropriate tag before using it with pyclts.

xrotwang commented 1 year ago

@jcgood It also turned out that the data at cldf-clts/clts was inconsistent. So now current pyclts should work with HEAD of cldf-clts/clts and we will make sure to only merge PR for the data after figuring out compatibility issues and possibly releasing an update of pyclts.