cldf-clts / clts

Cross-Linguistic Transcription Systems
https://clts.clld.org
13 stars 3 forks source link

Delete (!) graphemes from the automated mapping in the sources #78

Closed LinguList closed 3 years ago

LinguList commented 3 years ago

@tresoldi, you did not check the files that you converted, as they have (!) marks, which means that the mapping that was provided BY an expert is not a valid sound:

https://github.com/cldf-clts/pyclts/blob/ff806c08572bf57817e528d9d960b26c90ecb90f/src/pyclts/commands/map.py#L39-L40

So you really need to get used to manually checking the mapping, after you did it, and this is the reason why we have all these checks of the data.

LinguList commented 3 years ago

In phoible etc., there are other sounds with (?) as well.

tresoldi commented 3 years ago

@LinguList It hurts being publicly accused, over and over, of not paying attention or of doing something wrong due to carelessness or idleness, such as for the profiles. If I had said I had mapped everything, you should have given me a iota of trust and checked the files I had submitted, or at least considered some other explanation (maybe I forgot to commit something, silly things like that). Please, can we start to state the problems ("the files have (!) in them") and don't immediately herald these narratives to the public ("you did not check what you converted", which implies that all I did was running a command, while not only I fixed the unmapped graphemes, but made sure to painstakingly review each of the over 21,000 mappings across all sources)?

If you check the files I had prepared before you asked for the SYMBOLS column (the one my runs of clts map were not adding because you did not merge the code for that until after you publicly complained it was missing), for example https://github.com/cldf-clts/clts/blob/248302a5f3f4d042806936d2aec6034c684d23c8/sources/phoible/graphemes.tsv , you will see that all files were fully mapped, with no (!) or (?) marks.

What happened is that your mapping code, for relying on the recent is_valid_sound(), reverted my mappings in cases where I used valid graphemes in the sense of "graphemes that are correctly parsed by clts.bipa". To use the first example in phoible, you can see from the diffs that the mapping for ts͇ was reverted to (?)ts͇ (here), although it is an accepted grapheme:

>>> clts.bipa['ts͇']
<pyclts.models.Consonant: voiceless alveolar affricate consonant>

I assumed manual mappings (i.e., those without (!) etc.) would always be kept, and admit I was wrong in that. The intention of your mapping code is clear, reading it in detail. Maybe it is because I misunderstood you, and you don't want aliases even when they match the source grapheme. That is fine and it is not my place to agree or disagree with your decision.

I will manually fix all files and push them, with no PR (it would be too big after regenerating the package and the app).

xrotwang commented 3 years ago

I agree that any speculation about why something is the way it is does not have a place in public github comments.

LinguList commented 3 years ago

Having tried to explain the workflow in many emails and github comments over the last two weeks has made me a bit impatient, also because this was eating up a lot of my time, and I have always felt quite left alone when it comes to maintaining the CLTS data. For me, the "did not check" was a neutral comment, but I see that it can be read differently. It shows that the workflow is still not clear, however, so should I try to update the workflow procedure in this regard?

LinguList commented 3 years ago

To emphasize the importance of the valid_sound check: this check is the one we ultimately use when converting data from source/DATASET/graphemes.tsv to pkg/transcriptiondata/DATASET.tsv, so if a sound that BIPA parses is not valid (which may happen), this means it won't pass, and won't make it unto pkg/transcriptiondata/DATASET.tsv. So this check is quite important, and it was the reason why I inserted it into the map command (sind it is used afterwards anyway).

So should I try and make the workflow more transparent, or is the information in CONTRIBUTING.md enough?

tresoldi commented 3 years ago

Yes, thank you,I think it would help as that is something I had not understood. I now realize this is probably the intended purpose of the (?) mark added here: precisely to mark graphemes that are parsed but which do not pass the valid_sound check.

I will be pushing the updated mappings soon.

LinguList commented 3 years ago

There are not many updates needed, just a quick check of the question marks. Re-reading what I wrote in CONTRIBUTING.md, I think this should in fact suffice, and I'd suggest that you update this later, once we have settled on the workflow to add your perspective, as my explanations do not seem to be that easy to understand. BTW, I also recommend to read the parsing procedure at pyclts...

xrotwang commented 3 years ago

I think we also have to e careful to keep the number of moving pieces manageable. With CLTS, pyclts and pylexibank changing at the same time, things (and debugging in particular) get confusing. So maybe we should make a CLTS/pyclts pre-release as soon as possible?

LinguList commented 3 years ago

I am definitely supporting this. for an official release, I'd like to add three more transcription datasets. But we won't touch (I HOPE!) pyclts any more then. So yes: we should make a pre-release as soon as possible. I just updated three dataset and also updated CONTRIBUTING.md, to avoid misunderstandings (I assume this is clearer now, @tresoldi, if not, please update the information file yourself, or point to those cases which are unclear).

tresoldi commented 3 years ago

clts data should be ready after what I am preparing, the I should be able to push in one hour at most. allenbai, beidasinitic, and bdpa would still need some consistency work (like adding the SYMBOLS column), but can be used as they are. @LinguList , do you prefer me to take care of them as well? I would not touch the mapping, but can prepare the files accordingly.

As for pyclts, I think there are no plans of further changes before the next release.

LinguList commented 3 years ago

If you git-pull, you see that I already took care of allenbai, beidasinitic, etc.

tresoldi commented 3 years ago

I have updated all files, regenerated pkg and app, and pushed with https://github.com/cldf-clts/clts/commit/68c74941fe444b2314e58e680d41efe7aabe7fd1

I checked and all mappings in all files are either <NA> or are passing is_valid_sound(). I suppose once this issue is closed we could release the new pyclts/clts-data?

LinguList commented 3 years ago

as a pre-release, yes! we then need three more datasets for version 2.0.

xrotwang commented 3 years ago

So I'd propose v2.0.rc1 as version tag.

xrotwang commented 3 years ago

@tresoldi you updated package and app with HEAD of pyclts, correct? If so, I'd say with release pyclts first, and then run

clts check

with the released pyclts before releasing CLTS. Ok? And who is going to do that?

xrotwang commented 3 years ago

Oh and ideally, this procedure should be documented in a RELEASING.md

tresoldi commented 3 years ago

@xrotwang yes, I don't think it would even work with any previous version. Releasing pyclts first sounds very good (I think pyclts, the library, should still work with the previous clts-data).

However, there is no clts check, maybe you mean clts test? It is currently raising some warnings...

(env_clts) tresoldi@shh.mpg.de@dlt5802808l:~/lexibank/clts
$ ▶ clts test
INFO    cldf-clts/clts at .
WARNING lˠ      Sound not generated
WARNING áá
WARNING ã́ã́
WARNING éé
WARNING ẽ́ẽ́
WARNING íí
WARNING ĩ́ĩ́
WARNING óó
WARNING ṍṍ
WARNING à
WARNING á
WARNING â
WARNING ã̀
WARNING ã́
WARNING ᴀ̃/ã̱     Not matching codepoints
WARNING è
WARNING é
WARNING ê
WARNING ẽ̀
WARNING ẽ́
WARNING ì
WARNING í
WARNING î
WARNING ĩ́
WARNING ï       Sound not an alias
WARNING ò
WARNING ó
WARNING ô
WARNING õ̀
WARNING ṍ
WARNING õ̂
WARNING ú
WARNING ṹ
WARNING ə́
WARNING ə̃́
WARNING ɿ
WARNING ɿ̃
WARNING ʅ
WARNING ʅ̃
WARNING ʮ
xrotwang commented 3 years ago

I don't really know how to interpret these warnings. @LinguList ideas? I'll make the pyclts pre-release, then.

LinguList commented 3 years ago

no matching codepoints means that the codepoints specified in a given transcription dataset are not matching to those that have been generated by CLTS.

tresoldi commented 3 years ago

I am not sure about the first, , but it is probably related to us using it instead of the IPA character for consistency. ᴀ̃/ã̱ is probably mapped somewhere and missing, I will check. is surely a + where there should be non, in an alias column. I'll check the others.

LinguList commented 3 years ago

This hole test bunch there is based on a set of pre-compiled sounds, which we decided to compare the CLTS against. The fact that we now receive these warnings means one should obviously check those sounds with the data.

LinguList commented 3 years ago

File is 'test_data.tsv', in pyclts/data/. One should update this with the new version. We need a PR for this, but I am away now, so this will has to wait until Monday next week.

xrotwang commented 3 years ago

That's ok, may give me time to write a couple more tests.

LinguList commented 3 years ago

Very good. In fact, what I also would like to do, @xrotwang, is add one more column to all transcriptiondata files in pkg/transcriptiondata/, the column "SYMBOLS", which is now by default added to all mapped files in source/DATASET/graphemes.tsv

This column is very good for de-bugging, as we have some hard-to-read-symbols, and I think it can be healthy to expose the exact order also in the CLLD app.