Closed jcgood closed 1 year ago
The parsing issue also breaks, for example, the CLTS.iter_soundclass()
method:
>>> list(c.iter_soundclass())
[...]
ValueError: Unrecognized features (duration: ultra-long, line 129))
This problem only appears when using current pyclts
with HEAD of cldf-clts/clts
. I.e. HEAD of cldf-clts/clts
contains data that current pyclts
cannot deal with as of now.
The correct way to handle this is making sure to use compatible versions - and in particular to use released versions of the CLTS data, e.g.
cldfbench lexibank.makecldf ... --clts-version v2.2.0
with pyclts>=3.0
.
You may also git checkout
your data clone to an appropriate tag before using it with pyclts
.
@jcgood It also turned out that the data at cldf-clts/clts was inconsistent. So now current pyclts
should work with HEAD of cldf-clts/clts
and we will make sure to only merge PR for the data after figuring out compatibility issues and possibly releasing an update of pyclts
.
When converting data to CLDF, if the process of loading the BIPA transcription system in CLTS had an error, in this case due to duplicated graphemes and ultra-long not be associated with the models(?) of certain sounds (see Issue #45), this broke the process through which information was collected to write out the TRANSCRIPTION.md file and all words were treated as unsegmentable (even though they all were segmented in the CLDF). I patched this for my own work by changing some exceptions to warnings, especially since, for the data I was using, the problematic characters were not present in the data. (I didn't test this to see what happens if they were.)