cldf-clts / clts-legacy

Cross-Linguistic Transcription Systems
Apache License 2.0
4 stars 3 forks source link

Extended data from Wikipedia. #103

Closed tresoldi closed 6 years ago

tresoldi commented 6 years ago

As discussed at https://github.com/cldf/clts/issues/102

This extends the Wikipedia data with ~150 sounds and orders them.

It is important to note that some sounds seem to be missing from BIPA. I've manually compiled their names using similar sound names as reference, but this should be checked. The sounds are (with the grapheme used on Wikipedia and my manually composed name):

tresoldi commented 6 years ago

Ops, there's a conflict, I'll solve it.

tresoldi commented 6 years ago

I should have pulled before starting to work, I see wikipedia.tsv was changed by @LinguList on master in preparation to the paper.

There is no rush in fixing this pull request, and it is probably better to wait for the paper to be evaluated and all, just tell what you prefer me to do. My suggestion would be to extend the wiki.tsv file on master with any information (sounds) from this list.

LinguList commented 6 years ago

Yes, there's no rush, we have our numbers for the paper, and especially you did a great job in keeping with the app and all the discussions remotely. Looking forward to pursuing our work!

tresoldi commented 6 years ago

Updated the new wikipedia transcription data file. Some sounds are missing from BIPA, I've marked them in the NOTE field.

LinguList commented 6 years ago

I realize two things:

1 I should've told you that the wikipedia.tsv should be first put into the sources/ folder, from which it can be automatically linked (or manually) to the transcriptiondata folder, using the clts td command (I use that for making the process of semi-automatic linking easier, and one can test what is already covered 2 the missing sounds call for an addition to the source code and for the adding of new features (linguo-labial, the voiced bilabial tap consonant should be added manually to BIPA, and the retroflex implosive series as well

I think we can make new issues for the missing sounds in separation, so no need to add them now to bipa. By now putting this file into sources and deleting the sounds missing in bipa, we will have an automatic list of NA sounds that are not in bipa, and which we can then systematically (or unsystematically) add in the future.

tresoldi commented 6 years ago

It makes sense, I assumed the NAs were coming from some private, temporary script of yours. I'll do the necessary updates to bipa and put the file in sources/

Em 3 de fev de 2018 1:43 PM, "Johann-Mattis List" notifications@github.com escreveu:

I realize two things:

1 I should've told you that the wikipedia.tsv should be first put into the sources/ folder, from which it can be automatically linked (or manually) to the transcriptiondata folder, using the clts td command (I use that for making the process of semi-automatic linking easier, and one can test what is already covered 2 the missing sounds call for an addition to the source code and for the adding of new features (linguo-labial, the voiced bilabial tap consonant should be added manually to BIPA, and the retroflex implosive series as well

I think we can make new issues for the missing sounds in separation, so no need to add them now to bipa. By now putting this file into sources and deleting the sounds missing in bipa, we will have an automatic list of NA sounds that are not in bipa, and which we can then systematically (or unsystematically) add in the future.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cldf/clts/pull/103#issuecomment-362823930, or mute the thread https://github.com/notifications/unsubscribe-auth/AAar9yTLaQEMLkBVY7n83mEZ4FF_g9AWks5tRH6bgaJpZM4R0UUn .

LinguList commented 6 years ago

I should've added more documentation on how this is done. Something we can put on our todo-list for our official release of the package.

tresoldi commented 6 years ago

Took me a whole week for something that I could have done in five minutes... The current PR generates the data in transcriptiondata from the one in sources.

I haven't touched BIPA in terms of the missing sounds for two reasons. First, it seems more appropriate to do that in a separate PR; second, I'm still not entirely sure of what should be changed (taking a look at __main__.py suggests that dealing with data/features.tsv would be enough, besides calling again clts to regenerate what is needed).

As it does not seem to be urgent, maybe we should work on documentation and some more tests first?

LinguList commented 6 years ago

I'll look into this on Wednesday. I have an urgent paper deadline before...

LinguList commented 6 years ago

we can always do without adding missing sounds, but the order of entities in the file is now mixed, if you look at the tsv header (or it's just too early on a Sunday morning for me), we have BIPA -> URL -> FEATURES (=same as url without the prefix, at least in the current version) -> Grapheme, but the order is mixed, and the features are not necessarily the same as the url later (although one could change all features consistently by just deleting the spaces).

LinguList commented 6 years ago

haven't touched BIPA in terms of the missing sounds for two reasons. First, it seems more appropriate to do that in a separate PR; second, I'm still not entirely sure of what should be changed (taking a look at main.py suggests that dealing with data/features.tsv would be enough, besides calling again clts to regenerate what is needed).

Data/features is one part to be changed, but the algorithm requires -- if new sounds are introduced -- of course, that the respective sounds are added. If it's a diacritic, it can be handled by changing the diacritics.tsv in bipa, but if it's more complex, it needs to be submitted to bipa/consonants.tsv, etc.

Regarding plain tests, the thing is more or less covered (@xrotwang would probably say "less"). But what you mean with documentation is what we call "contributing" in concepticon: how can you propose changes? It is probably easiest to run through this if I find time to add a new sound and describe what needs to be done before a new PR. But unfortunately, I am quite busy until mid of next week, as I have an urgend paper deadline on unrelated topics...

tresoldi commented 6 years ago

First of all, no rush to answer, I imagine that besides your paper you probably also have the April workshop to organize.

This is a bit embarassing, I can't get it right. On the matter:

I should be able to make it right now. I will modify the sources/wiki on master by only adding the missing sounds (i.e., no completely new information), run clts to check for the missing sounds, and add them to IPA. Sorry for so much trouble for such a little change, my mind is also occupied by a lot of other things. ;)

LinguList commented 6 years ago

No problem. We are under no pressure with this. Aiming for around April (or even beginning of May) to publish online is extremely realistic, even if my plan in beatifying the clld css and adding some fancy colors is feasible ;) Even in the current version, the app could go online, but your input is extremely valuable to fill some gaps. Let's all take our time and reserve our stamina for April, when we can really start.

LinguList commented 6 years ago

closing this now, please reopoen in due time if the data is ready for this