improved documentation of which reference fasta files to use for which ghost-trees

gregcaporaso commented 9 years ago

@JTFouquier, I'm working on this with @johnchase and @kschwarzberg. We still have a couple of questions about this.

[ ] Can you confirm that the notes that we added are correct?

[ ] When counting the tips in the different ghost-trees, we notice that there are more tips in your 80% re-clustered tree than in your 100% re-clustered tree:

In [8]: tree97_80 = TreeNode.read("ghosttree_UNITEv6_30.12.2014S_97_80clusters_052515.nwk")

In [9]: tree97_100 = TreeNode.read("ghosttree_UNITEv6_30.12.2014S_97_100clusters_052515.nwk")

In [10]: tree97_80
Out[10]: <TreeNode, name: unnamed, internal node count: 29497, tips count: 29414>

In [11]: tree97_100
Out[11]: <TreeNode, name: unnamed, internal node count: 21204, tips count: 21020>

Are we interpreting that correctly, and if so, is there a problem with these files? (There should be more 100% OTUs than 80% OTUs.)

[ ] Since there are many more reference sequences than tips in the ghost-tree, we probably need to filter the reference sequences before we pick OTUs against them. Do you agree with that? Did you do that for the analyses in the paper?

gregcaporaso commented 9 years ago

@JTFouquier, we have another question. It looks like the tip names in the tree don't directly map to the sequence identifiers in UNITE. For example:

In [29]: from skbio.tree import TreeNode

In [30]: t = TreeNode.from_file('./ghosttree_UNITEv6_30.12.2014S_dynamic_100clusters_052515.nwk')

In [31]: tip_names = list([e.name for e in t.tips()])

In [32]: from skbio.parse.sequences import load

In [33]: s = load('/Users/caporaso/temp/sh_qiime_release_10.09.2014/sh_refs_qiime_ver6_dynamic_10.09.2014.fasta')
In [34]: seq_ids = [e['SequenceID'] for e in s]

In [35]: tip_names[:10]
Out[35]:
['SH448199.06FU UDB013352 reps singleton',
 'SH236252.06FU GU233364 reps',
 'SH434395.06FU GU233362 reps singleton',
 'SH190649.06FU GU174296 reps',
 'SH436635.06FU GU233358 reps singleton',
 'SH451958.06FU GU233328 refs singleton',
 'SH000180.06FU DQ486684 reps singleton',
 'SH000181.06FU DQ486685 reps singleton',
 'SH239792.06FU GU233327 refs',
 'SH012068.06FU DQ486683 reps singleton']

In [36]: seq_ids[:10]
Out[36]:
['SH189775.06FU_JQ347180_reps',
 'SH189776.06FU_U59145_refs',
 'SH189777.06FU_AM084756_reps',
 'SH189778.06FU_FM172814_reps',
 'SH189779.06FU_FN539058_reps',
 'SH189780.06FU_AB481260_refs',
 'SH189781.06FU_HQ211694_reps',
 'SH189782.06FU_JF937581_reps',
 'SH189783.06FU_AB745431_reps',
 'SH189784.06FU_JF937586_reps']

Are we looking at this correctly? How were you doing the mapping for the analyses in the paper? Ideally the ghost-tree tip names will map directly to UNITE ids, so users don't have to modify any files.

UPDATE:

We see that they are actually correct in the newick file, so we're trying to figure out what we need to do to not have the parser turn underscores into spaces.


In [37]: !head -c 500 ./ghosttree_UNITEv6_30.12.2014S_dynamic_100clusters_052515.nwk
((SH448199.06FU_UDB013352_reps_singleton:0.360825,(SH236252.06FU_GU233364_reps:0.14678,((SH434395.06FU_GU233362_reps_singleton:0.20729,(SH190649.06FU_GU174296_reps:0.04808,SH436635.06FU_GU233358_reps_singleton:0.0753)0.994:0.09625)0.643:0.04101,((SH451958.06FU_GU233328_refs_singleton:0.04435,(SH000180.06FU_DQ486684_reps_singleton:0.02609,SH000181.06FU_DQ486685_reps_singleton:0.0138)0.999:0.05579)1.000:0.12039,(SH239792.06FU_GU233327_refs:0.11442,SH012068.06FU_DQ486683_reps_singleton:0.09292)0.99

JTFouquier commented 9 years ago

@gregcaporaso, @johnchase and Karen, thank you guys for looking into this. :) I can't look into the PR in detail right now, but in regards to the 80% and 100% clusters, this makes sense to me. 80% should have more tips in the final tree. When you make larger groups as in the 80% clustering, you will most likely have a consensus "identified" genus, whereas when you have the 100% clusters, you will discard a lot of the unidentifieds. This was why we decided to do the OTU reclustering step in the first place. So it makes sense that the 80% has more tips.

With regards to your update and the spaces and underscores, is this my parser or skbio's parser? The naming convention has changed in UNITE and I didn't catch this behavior. Thank you!

JTFouquier / ghost-tree

improved documentation of which reference fasta files to use for which ghost-trees #46