Closed gregcaporaso closed 9 years ago
@JTFouquier, we have another question. It looks like the tip names in the tree don't directly map to the sequence identifiers in UNITE. For example:
In [29]: from skbio.tree import TreeNode
In [30]: t = TreeNode.from_file('./ghosttree_UNITEv6_30.12.2014S_dynamic_100clusters_052515.nwk')
In [31]: tip_names = list([e.name for e in t.tips()])
In [32]: from skbio.parse.sequences import load
In [33]: s = load('/Users/caporaso/temp/sh_qiime_release_10.09.2014/sh_refs_qiime_ver6_dynamic_10.09.2014.fasta')
In [34]: seq_ids = [e['SequenceID'] for e in s]
In [35]: tip_names[:10]
Out[35]:
['SH448199.06FU UDB013352 reps singleton',
'SH236252.06FU GU233364 reps',
'SH434395.06FU GU233362 reps singleton',
'SH190649.06FU GU174296 reps',
'SH436635.06FU GU233358 reps singleton',
'SH451958.06FU GU233328 refs singleton',
'SH000180.06FU DQ486684 reps singleton',
'SH000181.06FU DQ486685 reps singleton',
'SH239792.06FU GU233327 refs',
'SH012068.06FU DQ486683 reps singleton']
In [36]: seq_ids[:10]
Out[36]:
['SH189775.06FU_JQ347180_reps',
'SH189776.06FU_U59145_refs',
'SH189777.06FU_AM084756_reps',
'SH189778.06FU_FM172814_reps',
'SH189779.06FU_FN539058_reps',
'SH189780.06FU_AB481260_refs',
'SH189781.06FU_HQ211694_reps',
'SH189782.06FU_JF937581_reps',
'SH189783.06FU_AB745431_reps',
'SH189784.06FU_JF937586_reps']
Are we looking at this correctly? How were you doing the mapping for the analyses in the paper? Ideally the ghost-tree tip names will map directly to UNITE ids, so users don't have to modify any files.
UPDATE:
We see that they are actually correct in the newick file, so we're trying to figure out what we need to do to not have the parser turn underscores into spaces.
In [37]: !head -c 500 ./ghosttree_UNITEv6_30.12.2014S_dynamic_100clusters_052515.nwk
((SH448199.06FU_UDB013352_reps_singleton:0.360825,(SH236252.06FU_GU233364_reps:0.14678,((SH434395.06FU_GU233362_reps_singleton:0.20729,(SH190649.06FU_GU174296_reps:0.04808,SH436635.06FU_GU233358_reps_singleton:0.0753)0.994:0.09625)0.643:0.04101,((SH451958.06FU_GU233328_refs_singleton:0.04435,(SH000180.06FU_DQ486684_reps_singleton:0.02609,SH000181.06FU_DQ486685_reps_singleton:0.0138)0.999:0.05579)1.000:0.12039,(SH239792.06FU_GU233327_refs:0.11442,SH012068.06FU_DQ486683_reps_singleton:0.09292)0.99
@gregcaporaso, @johnchase and Karen, thank you guys for looking into this. :) I can't look into the PR in detail right now, but in regards to the 80% and 100% clusters, this makes sense to me. 80% should have more tips in the final tree. When you make larger groups as in the 80% clustering, you will most likely have a consensus "identified" genus, whereas when you have the 100% clusters, you will discard a lot of the unidentifieds. This was why we decided to do the OTU reclustering step in the first place. So it makes sense that the 80% has more tips.
With regards to your update and the spaces and underscores, is this my parser or skbio's parser? The naming convention has changed in UNITE and I didn't catch this behavior. Thank you!
@JTFouquier, I'm working on this with @johnchase and @kschwarzberg. We still have a couple of questions about this.
[ ] When counting the tips in the different ghost-trees, we notice that there are more tips in your 80% re-clustered tree than in your 100% re-clustered tree:
Are we interpreting that correctly, and if so, is there a problem with these files? (There should be more 100% OTUs than 80% OTUs.)