Open PhyloStar opened 6 years ago
I have the datasets, but do we have the permissions to upload these on GitHub? @erathorn?
If yes, which ones? Or all of them?
Look for licenses, ccby will allow you to share.
@LinguList, I just realised these are a subset of the datasets that we used for the svmcc project. Can I safely assume that we can upload them in the same way here?
Sure!
@erathorn, I will add them to the repo as soon as you show me the green light.. unless there are other considerations I am unaware of :)
If we do not have any problem with the licenses, go ahead.
Added with f39c92d4db1fe46420c78f8ad5e25a6bdacec061. However, as I took them from the svmcc project, some of them are in IPA (i.e. their original transcriptions). Should I: (1) replace them with the already converted to ASJP versions found in Taraka's repo; or (2) add code to the repo to do the conversion automatically (i.e. users will be able to throw both IPA and ASJP datasets at our scripts)?
If you want to make the comparison with lingpy's performance, as in your draft, you would need feed lingpy the IPA segments, AND make sure to discard all languages with bad mutual coverage (see https://github.com/digling/edictor-tutorial/raw/master/list-2017-edictor-tutorial.pdf, a recent tutorial, where mutual coverage is explained in full, it is implemented in lingpy and was a problem when dealing with SVM, let me know if you need more info). LingPy accepts ASJP letters if you do the schema-conversion, but of course, you break down the alphabet multiple times, and will reduce sounds more than usually.
It should also be clear that your code will be most useful if people can present their data in IPA as well (albeit that it shoudl be segmentized). If not, you'll be forcing people to convert to ASJP first, which is very inconvenient and may scare people off...
The example dataset in my repo has a IPA column segmented. So, I do allow users to choose between alphabets: ASJP, SCA, and Dolgo.
I think this discussion raises another question regarding which alphabet is performing well across the datasets. I think this is a different question not directly relevant to the paper.
Another thought not relevant to the paper is if it is possible to derive the weights for the sound transition graph in LexStat automatically from the data. One needs a criteria to optimize in such a case.
Another thing I noticed about the OnlinePMI implementation in this repo is that the results presented in the draft paper are based on a PMI matrix trained on a separate training dataset. Then, the trained PMI is tested on different datasets. This is quite similar to Gerhard's method of training PMI matrix.
In contrast, the PMI implementation I put in my repo does not require a separate training dataset and does not assume any initial set of probable cognates. It can use all the word pairs for synonyms in the dataset and then calculates the PMI score for a segment pair through partial counts. A partial count of a segment is a word pair's similarity score. A partial count is not 1 but a number between 0 and 1. Word pairs with high similarity contribute higher partial counts whereas less similar word pairs contribute lower partial counts. I call this soft PMI because of the partial counts involved in the calculation of PMI.
For instance, a word pair that is not at all similar would have very low similarity score (almost zero) and hence would contribute low frequency to the PMI formula. The opposite works for highly similar word pairs. Highly similar word pairs would contribute a high count to the PMI formula. The important thing about this method is that it does not assume any cutoff for obtaining a probable cognate list which Gerhard's PMI method requires. Moreover, the method is quite fast (takes 1 hour) for ABVD dataset (400 languages).
I wonder if it is worth including the modifications of PMI into the paper draft. If that is the case, then, the paper will undergo another round of review. On the other hand, if we make the necessary changes as suggested by the reviewers, the paper will go to print sooner. In the second case, the soft PMI method can be submitted elsewhere. I looking forward for any suggestions in this regard.
I think this discussion raises another question regarding which alphabet is performing well across the datasets. I think this is a different question not directly relevant to the paper.
I now know for sure that ASJP sound classes may help in certain datasets, and lexstat will have better results, but this was never quite followed up in my work, as I considered the gain unnecessary, although it might be quite substantial. This does not apply to the paper, but with our new clts-package (under construction), we'll be able to yield features for almost any regular IPA sound in some dataset, so we can then start thinking about creating our own sound classes of different degrees of coarseness or finegrainedness. I'd be curious to exchange ideas on this, as I don't really know how to best test this, but I have the intuition that we should try to search for the best way to cover distinguishability of sounds, with a minimal number of characters.
Another thought not relevant to the paper is if it is possible to derive the weights for the sound transition graph in LexStat automatically from the data. One needs a criteria to optimize in such a case.
In principle, one can feed lexstat both initial alignments (based on some other sound graph) and even cognate sets that should be used for the attested distribution, and if one creates a sound-class model on the fly, which contains a learned scorer, this could also be plugged in, but there are no ideas on how to infer transition graphs automatically from my side.
The first idea about finding the smallest set of symbols is important which means that the program should be automatically able to come up with the minimal set of symbols that important for cognate recognition. I have no immediate ideas about how to achieve this.
Regarding the second one, can we use PMI values as weights for sound transition graph? My understanding from PMI training is that the PMI values are good for closely related languages that do not show multiple sound transitions. The weights for the edges in the sound transition graph can come from PMI matrix which can then go into LexStat algorithm. I do not know of any way to find the weights of the transition graph automatically. It might require to optimize some language distance function like ASJP's LDND or Gerhard's method.
I have one simple idea, that I have been wanting to test for a long time, but I think it does not apply to the PMI algo, but could be later plugged in. E.g., to select from three available sound-class models, we would prefer the smalles model which still keeps different strings in the same language distinct, right? So this is an easy thing to do by simply creating an hash-table of dolgo, sca, etc., and seing how many mergers we have per language. E.g., if you have "ku" vs. "gu", and use SCA, this will merge to "KY", so in this case, ASJP is a better choice, etc. But I think this could even be expanded by having an algorithm that clusters sounds into classes (maybe a learning algo) but should avoid mergers. This is an additional problem, but extremely interesting, right?
It should also be clear that your code will be most useful if people can present their data in IPA as well (albeit that it shoudl be segmentized). If not, you'll be forcing people to convert to ASJP first, which is very inconvenient and may scare people off...
I agree. So we need something to handle the IPA -> ASJP conversion. @PhyloStar, I did not see any such code in your repo, your datasets are already converted and I assume that you used LingPy for that. Are we going to pull it in as a dependency or do another copy-paste?
Another thing I noticed about the OnlinePMI implementation in this repo is that the results presented in the draft paper are based on a PMI matrix trained on a separate training dataset. Then, the trained PMI is tested on different datasets. This is quite similar to Gerhard's method of training PMI matrix.
If I understand this correctly, I believe you are wrong: the PMI matrix is created using only the data of the particular dataset the algorithm is run on, i.e. there is no additional input of a previously trained matrix. So I guess we are already doing that. It would not hurt to make it explicit in the paper, of course :)
I guess what @PhyloStar is referring to is, we should have the option to provide the algorithm with one dataset which we can use for training and one which we can test on, e.g. train in ASJP, test on ABVD.
@erathorn Yes. You are right.This is what I meant.
@pavelsof In the paper we are always using ASJP as training data to train the PMI matrix. We never use the same dataset to train the PMI matrix. For instance, the code in OnlinePMI repo uses IELex dataset to train the PMI matrix and then cluster words.
@LinguList I think this is important and interesting problem. How to cluster the sound classes? Do you think it is possible to decompose a phoneme into its articulatory features and then cluster them? One can use Weighted Hamming distance to cluster the sound classes. Learning the weights is the question. It again requires some criteria based on word similarity which has to be minimized. One way could be to jointly maximize the similarity between synonyms and minimize the similarity between non-synonyms in a Swadesh list. There must be much better criteria to optimize which is more sensible for a historical linguist.
Yes, I am working on a system that already covers more than 70% (1200) sounds in phoible. This does not offer binary features, but the features are binarisable. But you know what I think would even be easier: Just simply cluster the sounds into classes, randomly, and use something like your CRP routine to get good clusters on a given dataset, the task being: with all N distinct sounds in the given dataset, make sure that per language the words which are still differently pronounced will stay differently pronounced. If you want to look into this, I can provide you with feature data and also give initial information into how I'd compute the features and we could compare even two approaches against each other. My gut feeling tells me, that the blind approach that I propose (distinctivity is all that counts) could reveal some interesting natural classes of sounds, even if there's no additional information. If you are in on this, @PhyloStar, let's maybe make a smallish repo, and I start adding my lingpy code there, and will also add some test sets coded in feature bundles.
[in the other thread]
@erathorn, @PhyloStar, thank you for the clarification, apparently I understood precisely the opposite. It is not difficult to add the possibility to train the matrix on one dataset and then use it on another, I can implement it if you say so.
More importantly though, it is still unclear to me what to do about the IPA datasets. The originals are in IPA and both algorithms can only consume ASJP, so the conversion has to happen somewhere. Either dynamically at each invocation or we upload them already converted. I personally favour the former because of the reasons pointed out by Mattis.
[in the main thread]
If you plan on making a public repo, please do provide the link. It is quite interesting to me even though I cannot really provide any input :)
We'll do, and we can also admit you to a private one.
As to the conversion, this is actually simple, if the data is tokenized (and you need tokenized IPA anyway). Have a look at lingpy's class2tokens function in lingpy.sequence.sound_classes
. This is a five-liner, if I remember properly, and by providing source and target, you can easily go back from aligned data to unaligned data. For asjp-conversion, it is more difficult, and the current solutions are all a bit ad-hoc. In this case, it might still be the most convenient to use lingpy as dependency, given that we plan anyway on extending its "pluggability", and we'll gladly announce that your code provides an enhancement to the basic lingpy cognate stuff via our channels.
@pavelsof There is already a function using LingPy in utils to convert to ASJP. In any case, here is the code to generate the tsv files in uniform_data folder. https://github.com/PhyloStar/OnlinePMI/blob/master/uniform_data.py Just change or add as you feel suitable.
@LinguList I am in. I will reply in the other thread.
@LinguList @PhyloStar I would like to join your discussion as well. I can also provide some IPA feature decomposition into ternary features, for a first testing ground. Also this whole discussion reminds me a bit of compressability in as well as some learning papers in the field of cultural evolution.
@pavelsof Why not use the lingpy tokenizer, or just have a small mapping of ipa to asjp symbols. We should have something like this lying around.
@erathorn, to which degree can you cover a large set of diverse and bad characters with features? In terms of phoible, have you tested how many characters you can represent, or on any of the test datasets? Our "cross-linguistic transcription system" might be interesting for you if you are interested in working on features in general. We're in the stage of writing up the paper, and the code is getting closer to version 1.0, and I'll gladly share our ideas.
And as to compression: you are right, we could think of it like this. We started an email thread on this and can include you there.
@LinguList I am not sure, what you mean by bad characters. I simply took all the IPA symbols in my data set and sat down with the IPA chart and converted it into said form. I have not checked Phoible coverage. After a first glimpse, I am afraid it might not be to high, since I am currently disregarding all diacritics. I would be interested in your approach, sure.
@pavelsof Can you add few example datasets to the repo?