Closed LinguList closed 3 years ago
In phoible etc., there are other sounds with (?)
as well.
@LinguList It hurts being publicly accused, over and over, of not paying attention or of doing something wrong due to carelessness or idleness, such as for the profiles. If I had said I had mapped everything, you should have given me a iota of trust and checked the files I had submitted, or at least considered some other explanation (maybe I forgot to commit something, silly things like that). Please, can we start to state the problems ("the files have (!)
in them") and don't immediately herald these narratives to the public ("you did not check what you converted", which implies that all I did was running a command, while not only I fixed the unmapped graphemes, but made sure to painstakingly review each of the over 21,000 mappings across all sources)?
If you check the files I had prepared before you asked for the SYMBOLS column (the one my runs of clts map
were not adding because you did not merge the code for that until after you publicly complained it was missing), for example https://github.com/cldf-clts/clts/blob/248302a5f3f4d042806936d2aec6034c684d23c8/sources/phoible/graphemes.tsv , you will see that all files were fully mapped, with no (!)
or (?)
marks.
What happened is that your mapping code, for relying on the recent is_valid_sound()
, reverted my mappings in cases where I used valid graphemes in the sense of "graphemes that are correctly parsed by clts.bipa
". To use the first example in phoible
, you can see from the diffs that the mapping for ts͇
was reverted to (?)ts͇
(here), although it is an accepted grapheme:
>>> clts.bipa['ts͇']
<pyclts.models.Consonant: voiceless alveolar affricate consonant>
I assumed manual mappings (i.e., those without (!)
etc.) would always be kept, and admit I was wrong in that. The intention of your mapping code is clear, reading it in detail. Maybe it is because I misunderstood you, and you don't want aliases even when they match the source grapheme. That is fine and it is not my place to agree or disagree with your decision.
I will manually fix all files and push them, with no PR (it would be too big after regenerating the package and the app).
I agree that any speculation about why something is the way it is does not have a place in public github comments.
Having tried to explain the workflow in many emails and github comments over the last two weeks has made me a bit impatient, also because this was eating up a lot of my time, and I have always felt quite left alone when it comes to maintaining the CLTS data. For me, the "did not check" was a neutral comment, but I see that it can be read differently. It shows that the workflow is still not clear, however, so should I try to update the workflow procedure in this regard?
To emphasize the importance of the valid_sound
check: this check is the one we ultimately use when converting data from source/DATASET/graphemes.tsv
to pkg/transcriptiondata/DATASET.tsv
, so if a sound that BIPA parses is not valid (which may happen), this means it won't pass, and won't make it unto pkg/transcriptiondata/DATASET.tsv
. So this check is quite important, and it was the reason why I inserted it into the map command (sind it is used afterwards anyway).
So should I try and make the workflow more transparent, or is the information in CONTRIBUTING.md enough?
Yes, thank you,I think it would help as that is something I had not understood. I now realize this is probably the intended purpose of the (?)
mark added here: precisely to mark graphemes that are parsed but which do not pass the valid_sound
check.
I will be pushing the updated mappings soon.
There are not many updates needed, just a quick check of the question marks. Re-reading what I wrote in CONTRIBUTING.md, I think this should in fact suffice, and I'd suggest that you update this later, once we have settled on the workflow to add your perspective, as my explanations do not seem to be that easy to understand. BTW, I also recommend to read the parsing procedure at pyclts...
I think we also have to e careful to keep the number of moving pieces manageable. With CLTS, pyclts and pylexibank changing at the same time, things (and debugging in particular) get confusing. So maybe we should make a CLTS/pyclts pre-release as soon as possible?
I am definitely supporting this. for an official release, I'd like to add three more transcription datasets. But we won't touch (I HOPE!) pyclts any more then. So yes: we should make a pre-release as soon as possible. I just updated three dataset and also updated CONTRIBUTING.md, to avoid misunderstandings (I assume this is clearer now, @tresoldi, if not, please update the information file yourself, or point to those cases which are unclear).
clts data should be ready after what I am preparing, the I should be able to push in one hour at most. allenbai
, beidasinitic
, and bdpa
would still need some consistency work (like adding the SYMBOLS column), but can be used as they are. @LinguList , do you prefer me to take care of them as well? I would not touch the mapping, but can prepare the files accordingly.
As for pyclts, I think there are no plans of further changes before the next release.
If you git-pull, you see that I already took care of allenbai, beidasinitic, etc.
I have updated all files, regenerated pkg and app, and pushed with https://github.com/cldf-clts/clts/commit/68c74941fe444b2314e58e680d41efe7aabe7fd1
I checked and all mappings in all files are either <NA>
or are passing is_valid_sound()
. I suppose once this issue is closed we could release the new pyclts/clts-data?
as a pre-release, yes! we then need three more datasets for version 2.0.
So I'd propose v2.0.rc1
as version tag.
@tresoldi you updated package and app with HEAD of pyclts, correct? If so, I'd say with release pyclts first, and then run
clts check
with the released pyclts before releasing CLTS. Ok? And who is going to do that?
Oh and ideally, this procedure should be documented in a RELEASING.md
@xrotwang yes, I don't think it would even work with any previous version. Releasing pyclts
first sounds very good (I think pyclts, the library, should still work with the previous clts-data).
However, there is no clts check
, maybe you mean clts test
? It is currently raising some warnings...
(env_clts) tresoldi@shh.mpg.de@dlt5802808l:~/lexibank/clts
$ ▶ clts test
INFO cldf-clts/clts at .
WARNING lˠ Sound not generated
WARNING áá
WARNING ã́ã́
WARNING éé
WARNING ẽ́ẽ́
WARNING íí
WARNING ĩ́ĩ́
WARNING óó
WARNING ṍṍ
WARNING à
WARNING á
WARNING â
WARNING ã̀
WARNING ã́
WARNING ᴀ̃/ã̱ Not matching codepoints
WARNING è
WARNING é
WARNING ê
WARNING ẽ̀
WARNING ẽ́
WARNING ì
WARNING í
WARNING î
WARNING ĩ́
WARNING ï Sound not an alias
WARNING ò
WARNING ó
WARNING ô
WARNING õ̀
WARNING ṍ
WARNING õ̂
WARNING ú
WARNING ṹ
WARNING ə́
WARNING ə̃́
WARNING ɿ
WARNING ɿ̃
WARNING ʅ
WARNING ʅ̃
WARNING ʮ
I don't really know how to interpret these warnings. @LinguList ideas? I'll make the pyclts pre-release, then.
no matching codepoints means that the codepoints specified in a given transcription dataset are not matching to those that have been generated by CLTS.
I am not sure about the first, lˠ
, but it is probably related to us using it instead of the IPA character for consistency. ᴀ̃/ã̱
is probably mapped somewhere and missing, I will check. ï
is surely a +
where there should be non, in an alias column. I'll check the others.
This hole test bunch there is based on a set of pre-compiled sounds, which we decided to compare the CLTS against. The fact that we now receive these warnings means one should obviously check those sounds with the data.
File is 'test_data.tsv', in pyclts/data/. One should update this with the new version. We need a PR for this, but I am away now, so this will has to wait until Monday next week.
That's ok, may give me time to write a couple more tests.
Very good. In fact, what I also would like to do, @xrotwang, is add one more column to all transcriptiondata files in pkg/transcriptiondata/, the column "SYMBOLS", which is now by default added to all mapped files in source/DATASET/graphemes.tsv
This column is very good for de-bugging, as we have some hard-to-read-symbols, and I think it can be healthy to expose the exact order also in the CLLD app.
@tresoldi, you did not check the files that you converted, as they have
(!)
marks, which means that the mapping that was provided BY an expert is not a valid sound:https://github.com/cldf-clts/pyclts/blob/ff806c08572bf57817e528d9d960b26c90ecb90f/src/pyclts/commands/map.py#L39-L40
So you really need to get used to manually checking the mapping, after you did it, and this is the reason why we have all these checks of the data.