D-PLACE / dplace-data

The data repository for the D-PLACE Project (Database of Places, Language, Culture and Environment)
https://d-place.org
Creative Commons Attribution 4.0 International
77 stars 37 forks source link

Uvea east is wall1257, not west2516 #293

Closed HedvigS closed 3 years ago

HedvigS commented 3 years ago

I'm not sure what's going on here, but I think "UveaEast" should be assigned to wall1257 not west2516.

HedvigS commented 3 years ago

I used the taxon and ABVD ID matches Simon sent me elsewhere to update all glottocodes in this file based on the language table in lexibank/ABVD. If you just want to make the East Uvean change, I can revert the committs for that.

HedvigS commented 3 years ago

This resulted in more changes than I had anticipated. See this diff log for the other changes, such as ibat1238 -> ivat1242 for taxon Babuyan.

xrotwang commented 3 years ago

Some changes match to language level languoids, when before dialects were matched. So arguably, we'd lose precision. @SimonGreenhill can you comment on this?

kirbykat commented 3 years ago

Hi @HedvigS, I'd prefer not to do a mass re-matching of societies to languages based on a list I am not familiar with, unless the list is based on someone's work to compare original ethnographic sources to possible dialect/language matches. Happy to consider each suggested change, though. And yes, some D-PLACE societies are matched to dialect-level glotto IDs, to help distinguish them from sister societies that speak the same language.

kirbykat commented 3 years ago

So, can we revert for now, and can you send me more info on the source you used for this? (It is very possible there are mistakes - language matches in some areas have been more heavily scrutinized than others)

xrotwang commented 3 years ago

@kirbykat nothing has been merged into master yet, so we can just discuss the changes with this PR and figure out what to merge eventually.

HedvigS commented 3 years ago

@kirbykat Yeah we can revert anything you want in this branch. It's not merged into master yet.

I used a list of the taxons matched to ABVD IDs from Simon, and then matched those to glottocodes based on the lexibank version of the ABVD-dataset and the language table there (i.e. this file) to create these new matchings of taxons to glottocodes. I've not used societies ID's at all. It went taxon -> ABVD ID -> glottocode.

The matching between taxons and glottocodes in the file at lexibank/ABVD looks better to me than this file at d-place data (for example it matches East Uvean to wall1257). But, there are several other matches that I can't evaluate. If they are incorrect, the underlying data for the language table at lexibank/ABVD should probably also be changed.

As you both noted, several times it looks like the leixbank/abvd-language table goes for language level glottocodes rather than dialects (for example matching the taxon "FijianBau" to the language-level glottocode fiji1243 instead of the dialect glottocode bauu1243). I don't know why that is. If it's not a good idea, it should probably also not be the case at lexibank/ABVD. I agree that losing precision sounds not good.

HedvigS commented 3 years ago

I should say, the way I as a user understand this file of gray et al 2009 taxa here at d-place data is primarily as meta-data on the original Gray et al 2009-paper. Not necessarily as the best matchings of D-PLACE societies IDs to glottocdes. Personally I'm not using the society IDs at all, I'm only interested in assigning correct glottocodes to the tree tips that the Gray et al paper describes in the way that is the most faithful to the underlying ABVD-data and the original paper.

kirbykat commented 3 years ago

@HedvigS, @xrotwang - ah, ok, sorry I didn't understand the changes were being made to the file that links the nexus tree for Austronesian to glottocodes (which we then use to link to xd_ids/soc_ids). The mismatches are probably an artifact of the iterations we have been through in terms of how we link tree taxa to D-PLACE societies -- if a society linked to a dialect was an accurate match for a particular tip, then it may be that the dialect rather than language was assigned to the tip at some point, because we didn't have a script that automated checking different 'levels' of glotto ID for matches. Also, (see below) in some cases manual matching provided more + better matches between societies and tree.

Here are some notes (last updated March 2017, but probably actually from much earlier - say, 2013!) on how trees were linked to D-PLACE societies. It is possible that @xrotwang and @SimonGreenhill have changed the system since then. In any case, @SimonGreenhill definitely better placed to comment on the accuracy of the tree --> society matches.

Steps taken to link trees to D-PLACE societies:

  1. Link each tip on linguistic tree to (1) an iso 639-3 code, and (2) a glottolog language or dialect (doculect). Choose the most detailed glottolog assignation that is appropriate.

_For example, the Austronesian tree includes two tree tips for the language “Marshallese”: Marshallese, and Marshalese ED. Rather than link both tips to the language-level glottolog id “mars1254”, the first was linked to the glottolog dialect “rali1241” (Rälik/Western) and the second to “rata1243” (Ratak/Eastern), both dialects of “mars1254” (iso “mah”) in the Glottolog classification.

In turn each of these dialects is linked to specific societies in D-PLACE: “rali1241” to society “Bikinians” and “rata1243”to societies “Majuro” and “Marshallese”.

  1. Produce a “mapping” file (CSV) where first column is taxa names from tree file, second column is iso code, third column is “finest level” glottolog id.

  2. Use this mapping file to map glottolog ids to societies.

Using the above example, this would map societies “Majuro” and “Marshallese” to “rata1243”; and “Bikinians” to “rali1241”.

  1. Next, for all “unmatched” tree tips, use the mapping file to map iso codes to societies.

Note the importance of doing step 3 before step 4: if step 4 had been done first, all three Marshallese societies would have been mapped to each of the two Marshallese tips in the Austronesian tree.

  1. In cases where a new match is found, check to see whether all societies should be linked to the tip, or just a subset of the societies (based on their dialects)

For example, in Step 1, the tree tip “Kiribati” was linked to glottolog id XXXX and iso code “gil”. There are two societies in D-PLACE that speak variants of this language: one has been linked to dialect “bana1287”, and the other to “nuii1237”. The automatic matching process of Step 3 will not link these societies to this tip. However, they will be linked via their iso code in Step 4.

We don’t have additional information on the origin of the linguistic data used to build the tree (i.e., if it is based on speakers from one of the two dialects). Therefore, we conclude that both dialects (and their associated societies) should be linked to the tree tip “Kiribati”. The glottolog id “xxxx” in the CSV file of Step 1 is replaced with “bana1287, nuii1237” for future use.

  1. Still to address: In a small number of cases, glottolog will not contain dialects for a given language, BUT the phylogeny includes distinct tips for different variants of a language, AND, those variants will be represented by societies in D-PLACE. In these cases, we would ideally have a mechanism by which to “force” the pairing of particular societies with a particular tree tip. This was originally done by including a column with xd_ids in the CSV file of Step 1.
HedvigS commented 3 years ago

Thanks @kirbykat . I've meddled in this file before and made some updates, since this is a resource I use and know things about. Since I don't use the society Ids, I don't really have anything to add there. I'm guessing that if some of these updates go through, they should also be updated.

There seems to have been something going awry somewhere, because for example for the Uveans the iso code in the file Simon sent me separately matched correctly. I don't know why the glottocodes were not the same. Either way, you're right let's wait for @SimonGreenhill review of the situation.

I sometimes petition Glottolog to create dialect glottocodes when it seems warranted, I'm guessing that could be done for D-PLACE data when it's relevant as well.

kirbykat commented 3 years ago

Quick follow-up comment: the example in Step 3 and Step 4 of my notes above show why dialect-level matching of tips to glotto IDs can be more accurate. But, the "UveaEast" example that started the thread does seem to be an error!

SimonGreenhill commented 3 years ago

abvd and dplace should match, with the highest precision, abvd should be more up to date.

SimonGreenhill commented 3 years ago

send me a list of the affecte ones and I'll fix.

HedvigS commented 3 years ago

Here's a list: taxa_matching.csv. The DIFF col tells you when the glottocodes match between D-PALCE/phylogenies and lexibank/ABVD. I've not touched the society IDs whatsoever.

You can also look at the DIFF log here.

Some of the changes make sense, others seem to be lexibank/ABVD preferring the language level glottocode. I agree with Kate that it'd be better to retain specificity when possible.

HedvigS commented 3 years ago

(You could also use something similar to pygrambank sourcelookupand check whether the references used for a particular language are indeed matched to that glottocode in Glottolog. That's beyond this, but that would be another thoroughness check.)

SimonGreenhill commented 3 years ago

Hmm. All but one of these is correct in the ABVD. I'm not sure how the mapping here is wrong. I'll close this and make a PR based on the data from the ABVD.

HedvigS commented 3 years ago

Thanks @SimonGreenhill !

HedvigS commented 3 years ago

just to check, this issue/PR is not only closed but the original matter is solved with this commit right? I can go back to using the taxa file in the dir for the gray et al tree here at dplace-data, correct?

SimonGreenhill commented 3 years ago

yes but please check that i fixed everything

HedvigS commented 3 years ago

I've gone through the new taxa file and compared with the diff log in this PR..

I've noticed the following:

I did some spotchecks between the new taxa file here and the one at lexibank/ABVD. It all seemed good :)!

SimonGreenhill commented 3 years ago

ahh, I left FutunaWest as uncoded as it was duplicate list (and has now been replaced on ABVD with a better version), however I'll tag it now anyway for completeness.

I'll update Malo, specificity is good, but will leave Baliledo until something better comes along.

HedvigS commented 3 years ago

@SimonGreenhill Thanks!