clld / glottolog3

glottolog2 re-implemented as CLLD app
MIT License
108 stars 25 forks source link

re node and tips names in tree_glottolog_newick.txt #144

Closed HedvigS closed 2 years ago

HedvigS commented 2 years ago

On the website under the tab download there's a newick file with all glottolog trees. I needed to just check something quickly, so I grabbed it and extracted one of the trees and was about to try something out. However, I ran into some trouble in R with the way ape.:read.tree() deals with the node and tip labels.

> trees <- ape::read.tree(file = "tree_glottolog_newick.txt")
> mayan <- trees[204][[1]]
> 
> mayan$tip.label
 [1] "Aguacateco-l-"              "Ixil-l-"                    "Mam-l-"                     "Tektiteko-l-"               "Kaqchikel-l-"              
 [6] "Tzutujil-l-"                "Achi-l-"                    "Kiche-l-"                   "Sacapulteco-l-"             "Sipacapense-l-"            
[11] "Kekchí-l-"                  "Poqomam-l-"                 "Poqomchi-l-"                "Uspanteco-l-"               "Chol-l-"                   
[16] "BuenaVistaChontal"          "MiramarChontal"             "TamultédelasSábanasChontal" "Cholti-l-"                  "Chortí-l-"                 
[21] "EpigraphicMayan-l-"         "ChanalCancuc"               "Tenango"                    "Tzotzil-l-"                 "Chuj-l-"                   
[26] "Tojolabal-l-"               "Akateko-l-"                 "Popti-l-"                   "Qanjobal-l-"                "Motozintleco"              
[31] "Tuzanteco"                  "Itzá-l-"                    "MopánMaya-l-"               "Lacanjá"                    "Najá"                      
[36] "YucatecMaya-l-"             "Chicomuceltec-l-"           "Huastec-l-"                
> 

For full clarity, the way this part of the newick file looks is:

((((('Aguacateco [agua1252][agu]-l-':1,'Ixil [ixil1251][ixl]-l-':1)'Ixilan [ixil1250]':1,('Mam [mamm1241][mam]-l-':1,'Tektiteko [tekt1235][ttc]-l-':1)'Mamean [mame1240]':1)'Greater Mamean [grea1277]':1,((('Kaqchikel [kaqc1270][cak]-l-':1,'Tz''utujil [tzut1248][tzj]-l-':1)'Cakchiquel-Tzutujil [cakc1244]':1,('Achi [achi1256][acr]-l-':1,'K''iche'' [kich1262][quc]-l-':1)'Quiche-Achi [quic1275]':1,'Sacapulteco [saca1238][quv]-l-':1,'Sipacapense [sipa1247][qum]-l-':1)'Core Quichean [core1251]':1,'Kekchí [kekc1242][kek]-l-':1,('Poqomam [poqo1253][poc]-l-':1,'Poqomchi'' [poqo1254][poh]-l-':1)'Poqom [poco1241]':1,'Uspanteco [uspa1245][usp]-l-':1)'Greater Quichean [grea1276]':1)'Quichean-Mamean [quic1274]':1,(((('Chol [chol1282][ctu]-l-':1,('Buena Vista Chontal [buen1245]':1,'Miramar Chontal [mira1253]':1,'Tamulté de las Sábanas Chontal [tamu1247]':1)'Tabasco Chontal [taba1266][chf]-l-':1)'Chol-Chontal [chol1281]':1,('Cholti [chol1283]-l-':1,'Chortí [chor1273][caa]-l-':1)'Chorti-Cholti [chor1272]':1,'Epigraphic Mayan [epig1241][emy]-l-':1)'Cholan [chol1287]':1,(('Chanal Cancuc [chan1320]':1,'Tenango [tena1239]':1)'Tzeltal [tzel1254][tzh]-l-':1,'Tzotzil [tzot1259][tzo]-l-':1)'Tzeltalan [tzel1253]':1)'Cholan-Tzeltalan [chol1286]':1,(('Chuj [chuj1250][cac]-l-':1,'Tojolabal [tojo1241][toj]-l-':1)'Chujean [chuj1249]':1,(('Akateko [west2635][knj]-l-':1,'Popti'' [popt1235][jac]-l-':1,'Q''anjob''al [qanj1241][kjb]-l-':1)'Kanjobal-Jacaltec [kanj1263]':1,('Motozintleco [moto1243]':1,'Tuzanteco [tuza1238]':1)'Mocho [moch1257][mhc]-l-':1)'Kanjobalan [kanj1262]':1)'Kanjobalan-Chujean [kanj1261]':1)'Western Mayan [west2865]':1,(('Itzá [itza1241][itz]-l-':1,'Mopán Maya [mopa1243][mop]-l-':1)'Mopan-Itza [mopa1242]':1,(('Lacanjá [laca1244]':1,'Najá [naja1242]':1)'Lacandon [laca1243][lac]-l-':1,'Yucatec Maya [yuca1254][yua]-l-':1)'Yucatec-Lacandon [yuca1253]':1)'Yucatecan [yuca1252]':1)'Core Mayan [core1254]':1,('Chicomuceltec [chic1271][cob]-l-':1,'Huastec [huas1242][hus]-l-':1)'Huastecan Mayan [huas1241]':1)'Mayan [maya1287]':1;

So, something is going awry here with spaces and I'm digging into it. This isn't really your problem, but I was wondering if there is any chance that in future this file could just contain the plain glottocode and nothing else for nodes and tips (no spaces, brackets etc)?

phytools::read.newick() is doing better, so I'll probably be able to solve this issue right now. However, I was thinking that mayeb this'll make others stumble so to make it easier maybe just have the bare glottocodes?

xrotwang commented 2 years ago

I don't think this file will change in the future. If anything, it will disappear, because the newick trees are available from https://github.com/glottolog/glottolog-cldf as well (with Newick containing only the Glottocode as languoid label). So the file here is only kept for backwards compatibility and changing it in any way would defeat the purpose.

HedvigS commented 2 years ago

Ok, fair enough.

Do you mean the content in "subclassification" in the values table? Sure, I can go looking there instead. I get it with the backwards compatibility.

Are the parameters by the way getting a Description in the parameters table?

xrotwang commented 2 years ago

Well, there is one, AFAICT, see link above https://github.com/glottolog/glottolog-cldf/blob/master/cldf/parameters.csv#L5

SimonGreenhill commented 2 years ago

(it's not the spaces that are the issue, it's that [.*?] are seen as comments by APEs parser)

HedvigS commented 2 years ago

Well, there is one, AFAICT, see link above https://github.com/glottolog/glottolog-cldf/blob/master/cldf/parameters.csv#L5

Huh, funny.. in my clone that I just pulled that table looks like this:

ID Name Description type
level Level   categorical
category Category   categorical
classification Classification    
subclassification Subclassification    
aes Agglomerated Endangerment Status   sequential
med Most Extensive Description   sequential

I'll go do some git investigations.

HedvigS commented 2 years ago

(it's not the spaces that are the issue, it's that [.*?] are seen as comments by APEs parser)

Oh good to know!!

HedvigS commented 2 years ago

I figured it out, i was using a clone of a fork of glottolog-cldf. I do that sometimes in case there's changes I want to try out. The files at glottolog/glottolog-cldf are, of course, as Robert said above.

Okay, all good!