JTFouquier / ghost-tree

creating hybrid-gene phylogenetic trees for diversity analyses
BSD 3-Clause "New" or "Revised" License
30 stars 20 forks source link

improved documentation of which reference fasta files to use for which ghost-trees #46

Closed gregcaporaso closed 9 years ago

gregcaporaso commented 9 years ago

@JTFouquier, I'm working on this with @johnchase and @kschwarzberg. We still have a couple of questions about this.

gregcaporaso commented 9 years ago

@JTFouquier, we have another question. It looks like the tip names in the tree don't directly map to the sequence identifiers in UNITE. For example:

In [29]: from skbio.tree import TreeNode

In [30]: t = TreeNode.from_file('./ghosttree_UNITEv6_30.12.2014S_dynamic_100clusters_052515.nwk')

In [31]: tip_names = list([e.name for e in t.tips()])

In [32]: from skbio.parse.sequences import load

In [33]: s = load('/Users/caporaso/temp/sh_qiime_release_10.09.2014/sh_refs_qiime_ver6_dynamic_10.09.2014.fasta')
In [34]: seq_ids = [e['SequenceID'] for e in s]

In [35]: tip_names[:10]
Out[35]:
['SH448199.06FU UDB013352 reps singleton',
 'SH236252.06FU GU233364 reps',
 'SH434395.06FU GU233362 reps singleton',
 'SH190649.06FU GU174296 reps',
 'SH436635.06FU GU233358 reps singleton',
 'SH451958.06FU GU233328 refs singleton',
 'SH000180.06FU DQ486684 reps singleton',
 'SH000181.06FU DQ486685 reps singleton',
 'SH239792.06FU GU233327 refs',
 'SH012068.06FU DQ486683 reps singleton']

In [36]: seq_ids[:10]
Out[36]:
['SH189775.06FU_JQ347180_reps',
 'SH189776.06FU_U59145_refs',
 'SH189777.06FU_AM084756_reps',
 'SH189778.06FU_FM172814_reps',
 'SH189779.06FU_FN539058_reps',
 'SH189780.06FU_AB481260_refs',
 'SH189781.06FU_HQ211694_reps',
 'SH189782.06FU_JF937581_reps',
 'SH189783.06FU_AB745431_reps',
 'SH189784.06FU_JF937586_reps']

Are we looking at this correctly? How were you doing the mapping for the analyses in the paper? Ideally the ghost-tree tip names will map directly to UNITE ids, so users don't have to modify any files.

UPDATE:

We see that they are actually correct in the newick file, so we're trying to figure out what we need to do to not have the parser turn underscores into spaces.


In [37]: !head -c 500 ./ghosttree_UNITEv6_30.12.2014S_dynamic_100clusters_052515.nwk
((SH448199.06FU_UDB013352_reps_singleton:0.360825,(SH236252.06FU_GU233364_reps:0.14678,((SH434395.06FU_GU233362_reps_singleton:0.20729,(SH190649.06FU_GU174296_reps:0.04808,SH436635.06FU_GU233358_reps_singleton:0.0753)0.994:0.09625)0.643:0.04101,((SH451958.06FU_GU233328_refs_singleton:0.04435,(SH000180.06FU_DQ486684_reps_singleton:0.02609,SH000181.06FU_DQ486685_reps_singleton:0.0138)0.999:0.05579)1.000:0.12039,(SH239792.06FU_GU233327_refs:0.11442,SH012068.06FU_DQ486683_reps_singleton:0.09292)0.99
JTFouquier commented 9 years ago

@gregcaporaso, @johnchase and Karen, thank you guys for looking into this. :) I can't look into the PR in detail right now, but in regards to the 80% and 100% clusters, this makes sense to me. 80% should have more tips in the final tree. When you make larger groups as in the 80% clustering, you will most likely have a consensus "identified" genus, whereas when you have the 100% clusters, you will discard a lot of the unidentifieds. This was why we decided to do the OTU reclustering step in the first place. So it makes sense that the 80% has more tips.

With regards to your update and the spaces and underscores, is this my parser or skbio's parser? The naming convention has changed in UNITE and I didn't catch this behavior. Thank you!