cldf-clts / clts-legacy

Cross-Linguistic Transcription Systems
Apache License 2.0
4 stars 3 forks source link

Report of Sound Comparisons CLTS errors #120

Closed LuPaschen closed 5 years ago

LuPaschen commented 5 years ago

Please report errors from Sound Comparisons here for processing. They will be moved to https://github.com/lingdb/Sound-Comparisons/issues/484 or https://github.com/cldf/clts/issues/121

LuPaschen commented 5 years ago

For some reason, the errors.md file in the repository gives a shorter list of errors than the error file created when I run the script on the same data on my laptop. I am attaching the error report here so you can compare. The main difference seems to be that in the attached file, all vowel sequence (even straightforward ones such as 'aɪ') are reported whereas in the errors.md in the repository, V sequences are not included.

tresoldi commented 5 years ago

I probably ran it before implementing the vowel_merge feature, or maybe I was tweaking with the command-line parameters. It is also possible that our source files are different, as I was using the one Paul had sent me by email.

I would trust more your results, in short. :wink:

LinguList commented 5 years ago

For some reason, the errors.md file in the repository gives a shorter list of errors than the error file created when I run the script on the same data on my laptop. I am attaching the error report here so you can compare. The main difference seems to be that in the attached file, all vowel sequence (even straightforward ones such as 'aɪ') are reported whereas in the errors.md in the repository, V sequences are not included.

This is, by the way, again not related to CLTS, right? This is related to sound comparison...

xrotwang commented 5 years ago

Maybe - to prevent this kind of reproducibility problem - we should also have the data in a repository? There's already https://github.com/clld/soundcomparisons-data/ so maybe this would be a good place?

tresoldi commented 5 years ago

Indeed. And given that you can specify the TSV file from the command line, there is no need to change the code for that.

LuPaschen commented 5 years ago

This is, by the way, again not related to CLTS, right? This is related to sound comparison...

Well, we do need to know if there is a general issue with the treatment of diphthongs or if something went wrong with running the script on our side...

LinguList commented 5 years ago

Well, we do need to know if there is a general issue with the treatment of diphthongs or if something went wrong with running the script on our side...

From your message, it was not clear, so specify the detailed "failures" or discuss them on soundcomparisons.

xrotwang commented 5 years ago

To re-iterate the reproducibility argument: which script are we talking about? Should it be included in this repos?

LinguList commented 5 years ago

this one. The SC team asked some time ago that @tresoldi helped them to do the lingpy analyses, as they want to have the alignments, so the script runs some checks on the data. In fact, we're waiting for @bibiko to actually look at the script and fix errors first on the SC side, as there are many problems with the transcriptions, so the general question resulting from using this script are in my opinion usually not general enough to be discussed here completely, specifically, since we loose the context (talking about a script that has nothing to do with clts).

So in this thread, only suggestions regarding potentially missing feature values in CLTS should be collected, and I'd hope that @cormacanderson finds time to evaluate them, also because I think that the IPA in SC is often too impressionistic, reflecting features that are barely found in the languages if one follows a broader phonetic transcription, and I'd even bet that the features are at times even due to misperception, or the perception of individual speaker variation as representing true phonetic variation, but that's another issue.

LuPaschen commented 5 years ago

@LinguList If you can point at concrete examples of where transcriptions in soundcomparisons are wrong, misperceived, or the use of specific symbols or diacritics is otherwise unwarranted, we're happy to hear your input. From what I can tell, the majority of transcriptions was made by experts on the respective languages/dialects, and every phonetic detail is potentially valuable for linguistic comparisons. Why should we limit ourselves to a fixed set of phonetic symbols when there are so many phonetic features that could potentially be of interest, also considering that the number of varieties in soundcomp is constantly growing? But this is probably not the right place to discuss this here.

Coming back to the question of the diphthongs, I've added some of the rejected tokens to the original post. I hope you agree that they should not be treated as errors.

cormacanderson commented 5 years ago

As the main crossover person between these two projects, i.e. SndCmp and CLTS, I feel it my responsibility to intervene here.

I don't see it as simple as the SndCmp team wanting the Edictor alignments. CLTS was assembled on the basis of data from a number of sources and we didn't get everything, it is quite clear. To improve the coverage of CLTS we need further data and SndCmp, with its narrow phonetic transcriptions, is ideal for this. Therefore, it's not one-directional. This is a win-win endeavour – SndCmp gets access to Edictor alignments, CLTS gets tested against an ideal dataset.

Pretty much all of what Ludger outlines there is straightforward IPA. CLTS should get it and if it doesn't, then we have to fix it. I will sit down with @tresoldi to try to add these diacritics to CLTS, along with other ones we have already identified. There is a further stimulus to do this given the news we have re the paper.

As for the concern re data quality @LinguList, I'm glad that you take this seriously. However, I've seen the data quality in CLICs, in bits of Lexibank, in the Concepticon, all with graver data quality problems than SndCmp in my book. It puzzles me that you focus on SndCmp in this regard. I also think that your suggestion of misperception or mistaking individual speaker variation for "true phonetic variation" (you mean "geographical variation" I presume?, as even intra-speaker variation is still true phonetic variation) is unfair to the people who did those transcriptions, who in most cases are really expert phoneticians who know these families really well. Individual errors of this sort there will surely be, but I don't think they are systematic. I second @LuPaschen in asking you to provide examples where you think there are mistakes – this is a collaborative project after all, we want people to help us find errors.

As for identifying features that don't occur in a broader transcription, well I think that's the point. We don't have other large resources like this of good narrow transcriptions that are machine readable (many of them are just languishing in the dialect atlases!). There are lots of things that we can't even begin to look at with broad transcriptions, but can with these. There's are all manner of questions about sound change that we can ask using narrow transcriptions that we can't ask with broader ones. I'll happily share examples with you in person if you wish.

LinguList commented 5 years ago

Okay, I trust you, @cormacanderson to get the things done so that they can be implemented in CLTS. My job is not to question whether symbols are needed, but to make sure the code runs, as nobody was thinking of making this codable before. We'll launch the first version of clts end of November, when @xrotwang is here, and all data needs to be adjusted and ready by then.

We'll talk again about data consistency if we have the things mapped consistently to CLTS and also corrected the problems in SC data. And then we can count how many different segments we find there, and we can discuss how many are needed in theory, and to which degree the fact that limited language data is transcribed here leads to phonetic transcriptions that would be made otherwise by different transcribers. For the task of automatic language comparison, the data cannot be used, as it is too sparse and too narrow, so that no regular sound correspondences can be identified. I tested on Germanic and Romance and can share this. That means in fact that if you want to look into sound change, the major argument, the regularity, is lost due to over-narrow transcriptions. But we can discuss this again once all alignments are readily assembled.

cormacanderson commented 5 years ago

Okay, good. I will coordinate also with @tresoldi on this.

As for the automatic language comparison, I will happily look at what you tested, although I have some ideas on things that can still be done with a relatively sparse dataset that simply cannot be done with broad transcriptions. Different questions require different types of data. I anyway disagree with your argument that the regularity is lost due to narrow transcriptions. If a pattern disappears with a higher resolution picture, was it really a pattern in the first place? As you say, we can discuss this in person later on though.

cormacanderson commented 5 years ago

Adding these updates to https://github.com/cldf/clts/issues/121. I will leave this thread open for reports of likely CLTS problems found in the SndCmp studies and will rename this to reflect that. After being reported here, I will assess them and move them to Adding these updates to https://github.com/cldf/clts/issues/121 when necessary.

tresoldi commented 5 years ago

Closing this issue as per https://github.com/cldf/clts/pull/123 ; once CLTS is released, we can rerun the analysis and check what is still pending (most of the problems should be fixed now).