Open mtholder opened 9 years ago
Just a follow up example of a hard case. It is not a pretty state of affairs. OTT ID 572918 has the troublesome name: "Oerskovia sp. 7(2011)"
That node is a child of OTT ID :125746 in the taxonomy and the synthetic tree.
We have methods that return the subtree in newick for the taxonony and the synthetic tree, but none of them return that name correctly quoted in newick. taxonomy/subtree
takes a label_format
argument for how to label the tips.
Using "id" works - but does not return the name (obviously)
Using "original_name" returns illegal newick.
Using "name" returns a name which (becuase it is quoted) changes the spaces to underscores in the name
Using "name_and_id" returns a pair of tokens for the name: 'Oerskovia_sp._7(2011)'_ott572918 I think that this is illegal (definitely is in NEXUS, but I think that it is in newick, too)
Using tree_of_life/subtree return a munged name: 'Oerskovia_sp_7_2011_ott572918' that is quoted, but has underscores and lacks the punctuation.
Details below:
original_name
curl -X POST http://devapi.opentreeoflife.org/v2/taxonomy/subtree -H 'Content-type:applicatio/json' -d '{"ott_id":125746, "label_format":"original_name" }'
returns:
{
"subtree" : "(Oerskovia turbata,Oerskovia sp. MP7,Oerskovia sp. 3146-i3a2,Oerskovia sp. MP4d,Oerskovia sp. 3146-i3b,Oerskovia sp. MP6d,Oerskovia sp. Tibet-YD4604-7,Oerskovia sp. Tibet-YD4604-5,Oerskovia sp. Eab19,Oerskovia sp. Bra16,Oerskovia sp. Ms17,uncultured Oerskovia sp.,Oerskovia sp. Ms38,Oerskovia sp. Ms37,Oerskovia sp. YIM 100718,Oerskovia sp. CHP-ZH25,Oerskovia sp. K2011,Oerskovia sp. K2012,Oerskovia sp. YIM 100122,Oerskovia sp. SAUK 6039,Oerskovia enterophila,Oerskovia sp. Y1,Oerskovia sp. Lgg15.9,Oerskovia sp. L1911,Oerskovia sp. YIM 100566,Oerskovia paurometabola,Oerskovia sp. SAUK6041,Oerskovia sp. SAUK6045,Oerskovia sp. VTT E-073039,Oerskovia sp. YIM 48801,Oerskovia sp. KBS0722,Oerskovia sp. 7(2011),Oerskovia sp. 463-2,Oerskovia sp. R-32754,Cellulomonas sp. UFZ-B529,Oerskovia sp. CATR-180,Oerskovia sp. B19,Oerskovia sp. B18,Oerskovia sp. B6,Oerskovia sp. B28,Oerskovia ginkgo,Oerskovia sp. 27(2011),Oerskovia sp. 26(2011),(Oerskovia turbata NBRC 15015)Oerskovia turbata,Oerskovia sp. SAUK 6042,Oerskovia sp. YIM 69644,Oerskovia sp. SAUK6219,Oerskovia sp. SAUK6230,Oerskovia jenensis,Oerskovia sp. S10(2012),Oerskovia sp. LCB39,Oerskovia sp. I_Gauze_W_12_3,Oerskovia sp. IHB B 3473,Oerskovia sp. B17,Oerskovia sp. PG1-2/67,Oerskovia sp. R-45820)Oerskovia;"
}
which is an illegal newick, because names with punctuation are not quoted. I suppose that one could argue that this is the correct behavior for this argument, but the fact that some names have parentheses implies to me that we should not support this.
Using name:
curl -X POST http://devapi.opentreeoflife.org/v2/taxonomy/subtree -H 'Content-type:applicatio/json' -d '{"ott_id":125746, "label_format":"name"}'
returns
{
"subtree" : "(Oerskovia_turbata,Oerskovia_sp._MP7,Oerskovia_sp._3146-i3a2,Oerskovia_sp._MP4d,Oerskovia_sp._3146-i3b,Oerskovia_sp._MP6d,Oerskovia_sp._Tibet-YD4604-7,Oerskovia_sp._Tibet-YD4604-5,Oerskovia_sp._Eab19,Oerskovia_sp._Bra16,Oerskovia_sp._Ms17,uncultured_Oerskovia_sp.,Oerskovia_sp._Ms38,Oerskovia_sp._Ms37,Oerskovia_sp._YIM_100718,Oerskovia_sp._CHP-ZH25,Oerskovia_sp._K2011,Oerskovia_sp._K2012,Oerskovia_sp._YIM_100122,Oerskovia_sp._SAUK_6039,Oerskovia_enterophila,Oerskovia_sp._Y1,Oerskovia_sp._Lgg15.9,Oerskovia_sp._L1911,Oerskovia_sp._YIM_100566,Oerskovia_paurometabola,Oerskovia_sp._SAUK6041,Oerskovia_sp._SAUK6045,Oerskovia_sp._VTT_E-073039,Oerskovia_sp._YIM_48801,Oerskovia_sp._KBS0722,'Oerskovia_sp._7(2011)',Oerskovia_sp._463-2,Oerskovia_sp._R-32754,Cellulomonas_sp._UFZ-B529,Oerskovia_sp._CATR-180,Oerskovia_sp._B19,Oerskovia_sp._B18,Oerskovia_sp._B6,Oerskovia_sp._B28,Oerskovia_ginkgo,'Oerskovia_sp._27(2011)','Oerskovia_sp._26(2011)',(Oerskovia_turbata_NBRC_15015)Oerskovia_turbata,Oerskovia_sp._SAUK_6042,Oerskovia_sp._YIM_69644,Oerskovia_sp._SAUK6219,Oerskovia_sp._SAUK6230,Oerskovia_jenensis,'Oerskovia_sp._S10(2012)',Oerskovia_sp._LCB39,Oerskovia_sp._I_Gauze_W_12_3,Oerskovia_sp._IHB_B_3473,Oerskovia_sp._B17,'Oerskovia_sp._PG1-2/67',Oerskovia_sp._R-45820)Oerskovia;"
}
is legal, but has some names with in them (because they are quoted) and other with being translated to spaces.
Using name_and_id
curl -X POST http://devapi.opentreeoflife.org/v2/taxonomy/subtree -H 'Content-type:applicatio/json' -d '{"ott_id":125746, "label_format":"name_and_id"}'
returns
{
"subtree" : "(Oerskovia_turbata_ott5255224,Oerskovia_sp._MP7_ott5371302,Oerskovia_sp._3146-i3a2_ott5371301,Oerskovia_sp._MP4d_ott5371300,Oerskovia_sp._3146-i3b_ott5371299,Oerskovia_sp._MP6d_ott5371298,Oerskovia_sp._Tibet-YD4604-7_ott5371297,Oerskovia_sp._Tibet-YD4604-5_ott5371296,Oerskovia_sp._Eab19_ott5161638,Oerskovia_sp._Bra16_ott5161637,Oerskovia_sp._Ms17_ott5161636,uncultured_Oerskovia_sp._ott5161635,Oerskovia_sp._Ms38_ott5161633,Oerskovia_sp._Ms37_ott5161632,Oerskovia_sp._YIM_100718_ott1081896,Oerskovia_sp._CHP-ZH25_ott1007916,Oerskovia_sp._K2011_ott866688,Oerskovia_sp._K2012_ott866687,Oerskovia_sp._YIM_100122_ott864224,Oerskovia_sp._SAUK_6039_ott856992,Oerskovia_enterophila_ott816580,Oerskovia_sp._Y1_ott732905,Oerskovia_sp._Lgg15.9_ott784893,Oerskovia_sp._L1911_ott677213,Oerskovia_sp._YIM_100566_ott714355,Oerskovia_paurometabola_ott697480,Oerskovia_sp._SAUK6041_ott606294,Oerskovia_sp._SAUK6045_ott606297,Oerskovia_sp._VTT_E-073039_ott654863,Oerskovia_sp._YIM_48801_ott647002,Oerskovia_sp._KBS0722_ott565557,'Oerskovia_sp._7(2011)'_ott572918,Oerskovia_sp._463-2_ott485674,Oerskovia_sp._R-32754_ott432589,Cellulomonas_sp._UFZ-B529_ott369560,Oerskovia_sp._CATR-180_ott375337,Oerskovia_sp._B19_ott385056,Oerskovia_sp._B18_ott385057,Oerskovia_sp._B6_ott385043,Oerskovia_sp._B28_ott385055,Oerskovia_ginkgo_ott385196,'Oerskovia_sp._27(2011)'_ott351018,'Oerskovia_sp._26(2011)'_ott351017,(Oerskovia_turbata_NBRC_15015_ott4770673)Oerskovia_turbata_ott301645,Oerskovia_sp._SAUK_6042_ott282206,Oerskovia_sp._YIM_69644_ott224479,Oerskovia_sp._SAUK6219_ott190501,Oerskovia_sp._SAUK6230_ott190503,Oerskovia_jenensis_ott174409,'Oerskovia_sp._S10(2012)'_ott149450,Oerskovia_sp._LCB39_ott138899,Oerskovia_sp._I_Gauze_W_12_3_ott136674,Oerskovia_sp._IHB_B_3473_ott142547,Oerskovia_sp._B17_ott121144,'Oerskovia_sp._PG1-2/67'_ott106812,Oerskovia_sp._R-45820_ott87860)Oerskovia_ott125746;"
}
which is illegal (I think) because some labels are now multiple tokens. For example: 'Oerskovia_sp._7(2011)'_ott572918
And using the tree_of_life service:
curl -X POST http://devapi.opentreeoflife.org/v2/tree_of_life/subtree -H 'Content-type:appliction/json' -d '{"ott_id":125746, "label_format":"name_and_id" }'
returns
{
"newick" : "((Oerskovia_turbata_NBRC_15015_ott4770673)Oerskovia_turbata_ott301645,Oerskovia_sp_MP7_ott5371302,Oerskovia_sp_B6_ott385043,'Oerskovia_sp_26_2011_ott351017',Oerskovia_sp_CHP-ZH25_ott1007916,Oerskovia_sp_CATR-180_ott375337,Oerskovia_sp_YIM_100122_ott864224,Oerskovia_sp_B18_ott385057,Oerskovia_sp_KBS0722_ott565557,Oerskovia_sp_SAUK6230_ott190503,Oerskovia_sp_3146-i3a2_ott5371301,Oerskovia_sp_YIM_69644_ott224479,Oerskovia_sp_K2012_ott866687,Oerskovia_sp_YIM_100566_ott714355,Oerskovia_sp_MP4d_ott5371300,Oerskovia_sp_Ms17_ott5161636,Oerskovia_sp_3146-i3b_ott5371299,'Oerskovia_sp_PG1-2_67_ott106812',Oerskovia_sp_MP6d_ott5371298,Oerskovia_sp_K2011_ott866688,Oerskovia_sp_B19_ott385056,Oerskovia_sp_L1911_ott677213,Oerskovia_sp_R-32754_ott432589,Oerskovia_sp_Ms38_ott5161633,'Oerskovia_sp_27_2011_ott351018',Oerskovia_sp_SAUK6045_ott606297,Oerskovia_sp_SAUK6219_ott190501,Oerskovia_sp_Ms37_ott5161632,Oerskovia_sp_R-45820_ott87860,Oerskovia_sp_Lgg15_9_ott784893,Oerskovia_sp_B17_ott121144,Oerskovia_sp_Tibet-YD4604-7_ott5371297,Oerskovia_sp_LCB39_ott138899,Oerskovia_sp_YIM_100718_ott1081896,Oerskovia_sp_Bra16_ott5161637,Oerskovia_sp_463-2_ott485674,Oerskovia_sp_Y1_ott732905,Oerskovia_sp_B28_ott385055,'Oerskovia_sp_7_2011_ott572918',Oerskovia_sp_SAUK6041_ott606294,Oerskovia_sp_SAUK_6039_ott856992,Oerskovia_sp_Tibet-YD4604-5_ott5371296,Oerskovia_sp_Eab19_ott5161638,Oerskovia_sp_VTT_E-073039_ott654863,'Oerskovia_sp_S10_2012_ott149450',Oerskovia_sp_SAUK_6042_ott282206,Oerskovia_sp_IHB_B_3473_ott142547,Oerskovia_sp_YIM_48801_ott647002,Oerskovia_turbata_ott5255224,Cellulomonas_sp_UFZ-B529_ott369560,Oerskovia_enterophila_ott816580,Oerskovia_ginkgo_ott385196,Oerskovia_jenensis_ott174409,Oerskovia_paurometabola_ott697480,Oerskovia_sp_I_Gauze_W_12_3_ott136674,uncultured_Oerskovia_sp_ott5161635)Oerskovia_ott125746;",
"tree_id" : "otol.draft.22"
}
includes name munging to give a single quoted 'Oerskovia_sp_7_2011_ott572918'
Yes, we were going to include a proper newick writer in the jade OT-base classes, so we could correctly process newick names, and use the same code across treemachine and taxomachine. Joseph and I haven't wanted to mess with the newick names until that is done, and it hasn't been done yet...
It will be good to have this reference of the current issues when it comes time to write the corrected name writer.
On Wednesday, December 10, 2014, Mark T. Holder notifications@github.com wrote:
Just a follow up example of a hard case. It is not a pretty state of affairs. OTT ID 572918 has the troublesome name: "Oerskovia sp. 7(2011)"
That node is a child of OTT ID :125746 in the taxonomy and the synthetic tree.
We have methods that return the subtree in newick for the taxonony and the synthetic tree, but none of them return that name correctly quoted in newick. taxonomy/subtree takes a label_format argument for how to label the tips.
Using "id" works - but does not return the name (obviously)
Using "original_name" returns illegal newick.
Using "name" returns a name which (becuase it is quoted) changes the spaces to underscores in the name
Using "name_and_id" returns a pair of tokens for the name: 'Oerskovia_sp._7(2011)'_ott572918 I think that this is illegal (definitely is in NEXUS, but I think that it is in newick, too)
Using tree_of_life/subtree return a munged name: 'Oerskovia_sp_7_2011_ott572918' that is quoted, but has underscores and lacks the punctuation.
Details below:
original_name
curl -X POST http://devapi.opentreeoflife.org/v2/taxonomy/subtree -H 'Content-type:applicatio/json' -d '{"ott_id":125746, "label_format":"original_name" }'
returns:
{ "subtree" : "(Oerskovia turbata,Oerskovia sp. MP7,Oerskovia sp. 3146-i3a2,Oerskovia sp. MP4d,Oerskovia sp. 3146-i3b,Oerskovia sp. MP6d,Oerskovia sp. Tibet-YD4604-7,Oerskovia sp. Tibet-YD4604-5,Oerskovia sp. Eab19,Oerskovia sp. Bra16,Oerskovia sp. Ms17,uncultured Oerskovia sp.,Oerskovia sp. Ms38,Oerskovia sp. Ms37,Oerskovia sp. YIM 100718,Oerskovia sp. CHP-ZH25,Oerskovia sp. K2011,Oerskovia sp. K2012,Oerskovia sp. YIM 100122,Oerskovia sp. SAUK 6039,Oerskovia enterophila,Oerskovia sp. Y1,Oerskovia sp. Lgg15.9,Oerskovia sp. L1911,Oerskovia sp. YIM 100566,Oerskovia paurometabola,Oerskovia sp. SAUK6041,Oerskovia sp. SAUK6045,Oerskovia sp. VTT E-073039,Oerskovia sp. YIM 48801,Oerskovia sp. KBS0722,Oerskovia sp. 7(2011),Oerskovia sp. 463-2,Oerskovia sp. R-32754,Cellulomonas sp. UFZ-B529,Oerskovia sp. CATR-180,Oerskovia sp. B19,Oerskovia sp. B18,Oerskovia sp. B6,Oerskovia sp. B28,Oerskovia ginkgo,Oerskovia sp. 27(2011),Oerskovia sp. 26(2011),(Oerskovia turbata NBRC 15015)Oerskovia turbata,Oe rskovia sp. SAUK 6042,Oerskovia sp. YIM 69644,Oerskovia sp. SAUK6219,Oerskovia sp. SAUK6230,Oerskovia jenensis,Oerskovia sp. S10(2012),Oerskovia sp. LCB39,Oerskovia sp. I_Gauze_W_12_3,Oerskovia sp. IHB B 3473,Oerskovia sp. B17,Oerskovia sp. PG1-2/67,Oerskovia sp. R-45820)Oerskovia;" }
which is an illegal newick, because names with punctuation are not quoted. I suppose that one could argue that this is the correct behavior for this argument, but the fact that some names have parentheses implies to me that we should not support this.
Using name:
curl -X POST http://devapi.opentreeoflife.org/v2/taxonomy/subtree -H 'Content-type:applicatio/json' -d '{"ott_id":125746, "label_format":"name"}'
returns
{ "subtree" : "(Oerskovia_turbata,Oerskovia_sp._MP7,Oerskovia_sp._3146-i3a2,Oerskovia_sp._MP4d,Oerskovia_sp._3146-i3b,Oerskovia_sp._MP6d,Oerskovia_sp._Tibet-YD4604-7,Oerskovia_sp._Tibet-YD4604-5,Oerskovia_sp._Eab19,Oerskovia_sp._Bra16,Oerskovia_sp._Ms17,uncultured_Oerskovia_sp.,Oerskovia_sp._Ms38,Oerskovia_sp._Ms37,Oerskovia_sp._YIM_100718,Oerskovia_sp._CHP-ZH25,Oerskovia_sp._K2011,Oerskovia_sp._K2012,Oerskovia_sp._YIM_100122,Oerskovia_sp._SAUK_6039,Oerskovia_enterophila,Oerskovia_sp._Y1,Oerskovia_sp._Lgg15.9,Oerskovia_sp._L1911,Oerskovia_sp._YIM_100566,Oerskovia_paurometabola,Oerskovia_sp._SAUK6041,Oerskovia_sp._SAUK6045,Oerskovia_sp._VTT_E-073039,Oerskovia_sp._YIM_48801,Oerskovia_sp._KBS0722,'Oerskovia_sp._7(2011)',Oerskovia_sp._463-2,Oerskovia_sp._R-32754,Cellulomonas_sp._UFZ-B529,Oerskovia_sp._CATR-180,Oerskovia_sp._B19,Oerskovia_sp._B18,Oerskovia_sp._B6,Oerskovia_sp._B28,Oerskovia_ginkgo,'Oerskovia_sp._27(2011)','Oerskovia_sp._26(2011)',(Oerskovia_turbata_NBRC_15015)Oerskovia_tu rbata,Oerskovia_sp._SAUK_6042,Oerskovia_sp._YIM_69644,Oerskovia_sp._SAUK6219,Oerskovia_sp._SAUK6230,Oerskovia_jenensis,'Oerskovia_sp._S10(2012)',Oerskovia_sp._LCB39,Oerskovia_sp._I_Gauze_W_12_3,Oerskovia_sp._IHB_B_3473,Oerskovia_sp._B17,'Oerskovia_sp._PG1-2/67',Oerskovia_sp._R-45820)Oerskovia;" }
is legal, but has some names with in them (because they are quoted) and other with being translated to spaces.
Using name_and_id
curl -X POST http://devapi.opentreeoflife.org/v2/taxonomy/subtree -H 'Content-type:applicatio/json' -d '{"ott_id":125746, "label_format":"name_and_id"}'
returns
{ "subtree" : "(Oerskovia_turbata_ott5255224,Oerskovia_sp._MP7_ott5371302,Oerskovia_sp._3146-i3a2_ott5371301,Oerskovia_sp._MP4d_ott5371300,Oerskovia_sp._3146-i3b_ott5371299,Oerskovia_sp._MP6d_ott5371298,Oerskovia_sp._Tibet-YD4604-7_ott5371297,Oerskovia_sp._Tibet-YD4604-5_ott5371296,Oerskovia_sp._Eab19_ott5161638,Oerskovia_sp._Bra16_ott5161637,Oerskovia_sp._Ms17_ott5161636,uncultured_Oerskovia_sp._ott5161635,Oerskovia_sp._Ms38_ott5161633,Oerskovia_sp._Ms37_ott5161632,Oerskovia_sp._YIM_100718_ott1081896,Oerskovia_sp._CHP-ZH25_ott1007916,Oerskovia_sp._K2011_ott866688,Oerskovia_sp._K2012_ott866687,Oerskovia_sp._YIM_100122_ott864224,Oerskovia_sp._SAUK_6039_ott856992,Oerskovia_enterophila_ott816580,Oerskovia_sp._Y1_ott732905,Oerskovia_sp._Lgg15.9_ott784893,Oerskovia_sp._L1911_ott677213,Oerskovia_sp._YIM_100566_ott714355,Oerskovia_paurometabola_ott697480,Oerskovia_sp._SAUK6041_ott606294,Oerskovia_sp._SAUK6045_ott606297,Oerskovia_sp._VTT_E-073039_ott654863,Oerskovia_sp._YIM_48801_ott647002,O erskovia_sp._KBS0722_ott565557,'Oerskovia_sp._7(2011)'_ott572918,Oerskovia_sp._463-2_ott485674,Oerskovia_sp._R-32754_ott432589,Cellulomonas_sp._UFZ-B529_ott369560,Oerskovia_sp._CATR-180_ott375337,Oerskovia_sp._B19_ott385056,Oerskovia_sp._B18_ott385057,Oerskovia_sp._B6_ott385043,Oerskovia_sp._B28_ott385055,Oerskovia_ginkgo_ott385196,'Oerskovia_sp._27(2011)'_ott351018,'Oerskovia_sp._26(2011)'_ott351017,(Oerskovia_turbata_NBRC_15015_ott4770673)Oerskovia_turbata_ott301645,Oerskovia_sp._SAUK_6042_ott282206,Oerskovia_sp._YIM_69644_ott224479,Oerskovia_sp._SAUK6219_ott190501,Oerskovia_sp._SAUK6230_ott190503,Oerskovia_jenensis_ott174409,'Oerskovia_sp._S10(2012)'_ott149450,Oerskovia_sp._LCB39_ott138899,Oerskovia_sp._I_Gauze_W_12_3_ott136674,Oerskovia_sp._IHB_B_3473_ott142547,Oerskovia_sp._B17_ott121144,'Oerskovia_sp._PG1-2/67'_ott106812,Oerskovia_sp._R-45820_ott87860)Oerskovia_ott125746;" }
which is illegal (I think) because some labels are now multiple tokens. For example: 'Oerskovia_sp._7(2011)'_ott572918
And using the tree_of_life service:
curl -X POST http://devapi.opentreeoflife.org/v2/tree_of_life/subtree -H 'Content-type:appliction/json' -d '{"ott_id":125746, "label_format":"name_and_id" }'
returns
{ "newick" : "((Oerskovia_turbata_NBRC_15015_ott4770673)Oerskovia_turbata_ott301645,Oerskovia_sp_MP7_ott5371302,Oerskovia_sp_B6_ott385043,'Oerskovia_sp_26_2011_ott351017',Oerskovia_sp_CHP-ZH25_ott1007916,Oerskovia_sp_CATR-180_ott375337,Oerskovia_sp_YIM_100122_ott864224,Oerskovia_sp_B18_ott385057,Oerskovia_sp_KBS0722_ott565557,Oerskovia_sp_SAUK6230_ott190503,Oerskovia_sp_3146-i3a2_ott5371301,Oerskovia_sp_YIM_69644_ott224479,Oerskovia_sp_K2012_ott866687,Oerskovia_sp_YIM_100566_ott714355,Oerskovia_sp_MP4d_ott5371300,Oerskovia_sp_Ms17_ott5161636,Oerskovia_sp_3146-i3b_ott5371299,'Oerskovia_sp_PG1-2_67_ott106812',Oerskovia_sp_MP6d_ott5371298,Oerskovia_sp_K2011_ott866688,Oerskovia_sp_B19_ott385056,Oerskovia_sp_L1911_ott677213,Oerskovia_sp_R-32754_ott432589,Oerskovia_sp_Ms38_ott5161633,'Oerskovia_sp_27_2011_ott351018',Oerskovia_sp_SAUK6045_ott606297,Oerskovia_sp_SAUK6219_ott190501,Oerskovia_sp_Ms37_ott5161632,Oerskovia_sp_R-45820_ott87860,Oerskovia_sp_Lgg15_9_ott784893,Oerskovia_sp_B17_ott1211 44,Oerskovia_sp_Tibet-YD4604-7_ott5371297,Oerskovia_sp_LCB39_ott138899,Oerskovia_sp_YIM_100718_ott1081896,Oerskovia_sp_Bra16_ott5161637,Oerskovia_sp_463-2_ott485674,Oerskovia_sp_Y1_ott732905,Oerskovia_sp_B28_ott385055,'Oerskovia_sp_7_2011_ott572918',Oerskovia_sp_SAUK6041_ott606294,Oerskovia_sp_SAUK_6039_ott856992,Oerskovia_sp_Tibet-YD4604-5_ott5371296,Oerskovia_sp_Eab19_ott5161638,Oerskovia_sp_VTT_E-073039_ott654863,'Oerskovia_sp_S10_2012_ott149450',Oerskovia_sp_SAUK_6042_ott282206,Oerskovia_sp_IHB_B_3473_ott142547,Oerskovia_sp_YIM_48801_ott647002,Oerskovia_turbata_ott5255224,Cellulomonas_sp_UFZ-B529_ott369560,Oerskovia_enterophila_ott816580,Oerskovia_ginkgo_ott385196,Oerskovia_jenensis_ott174409,Oerskovia_paurometabola_ott697480,Oerskovia_sp_I_Gauze_W_12_3_ott136674,uncultured_Oerskovia_sp_ott5161635)Oerskovia_ott125746;", "tree_id" : "otol.draft.22" }
includes name munging to give a single quoted 'Oerskovia_sp_7_2011_ott572918'
— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/147#issuecomment-66444498 .
It looks like (for at least the draftversion2.tre newick) the decision about whether to quote is made before the substitution of _ for punctuation. So some labels like Fibrobacteres/Acidobacteria group
get converted to 'Fibrobacteres_Acidbacteria_group'
with single quotes (even though we don't quote tokens with _ as the only odd character in other contexts).
mtholder: for some reason, "/"
got put in the "newick-illegal" list. I'll fx that.
This should probably be fixed, for the same of invertibility... or is invertibility a lost cause?
invertibility is not a lost cause in terms of regenerating the real OTT name from the newick/nexus.
We can encode and string in newick or nexus.
There are often 2 legal syntaxes in those formats for any string (a quoted form and an unquoted form). One can't reliably predict which of the two forms will be used (without looking at the code). So you can't go: newick -> internal representation -> newick and guarantee the exact same form.
But you can go: any string -> newick -> any string exactly.
And you can go: newick -> internal representation -> equivalent newick representation
Off the back of OpenTree v5 and the new naming scheme, I suggest that it might be useful if taxon names in the downloadable newick file did not contain braces and commas (and possibly not colons either). This makes it easy to parse the newick file using regular expressions and the like, without having to parse the actual tree structure, or parse the quoting of labels. That makes it a lot faster and less memory intensive to mess with the tree. Of course, this may make it impossible to maintain consistent labels between tree machine and taxomachine, so I can foresee objections.
There's an issue for invertibility: https://github.com/OpenTreeOfLife/germinator/issues/76
OK @hyanwong I have posted a version of the tree with simplified names at http://phylo.bio.ku.edu/ot/opentree5.0_simplified_names.tre.gz and a log of the edits at http://phylo.bio.ku.edu/ot/munging_log.txt
We'll still need to work this step into the pipeline and figure out a statndard name for this output. @bredelings and I both came up with tools to do this. His otc-relabel-tree ex_2_tree1.tre --replace "/[;[\]()]/ /"
invocation (in the code repo in https://github.com/mtholder/otcetera ) is probably going to be the version that we end up using. But I made the tree above with https://github.com/mtholder/otcetera/blob/master/tools/mungenames.cpp
I should have mentioned that I replaced a few other characters that other users might want to avoid
@mtholder thanks. Yes, putting the step into the pipeline would be useful, and documenting it. @jar398 I guess it is the braces that are most likely to cause problems - . An ugly solution to retain invertibility would be to e.g. replace () with <> (gt & lt signs only appear in 10 taxa on the tree) or {} (no current taxa contain curly braces). But that seems rather hacky.
In http://phylo.bio.ku.edu/ot/munging_log.txt there are 510 names that contain commas: these are nearly all where the taxon name contains the authority or year of description. I don't know if this is something that you want to be included in the taxon name or not.
I'd still vote for not munging the names, but if we do continue this we should explain it to users.
I think that the relevant code is: https://github.com/OpenTreeOfLife/ot-base/blob/master/src/main/java/org/opentree/utils/GeneralUtils.java
I think that the explanation now is that we create "TAXONNAME_ottOTTID" as the label, then use normal newick escaping rules except:
A. all colons are converted to B. all spaces go to before making the quoting decision. C. _ characters are ignored in the quoting decision.
But I'm not sure if the getNewick in https://github.com/OpenTreeOfLife/treemachine/blob/master/src/main/java/jade/tree/JadeNode.java then does some character replacement
Email thread: https://groups.google.com/forum/#!topic/opentreeoflife/4_5DYH5deS0