Temporarily add Genbank ids to SILVA and NCBI taxonomy inputs

jar398 commented 9 years ago

For the SILVA/NCBI alignment we can use Genbank ids to determine taxon equations based on Genbank id membership (in particular the subset of Genbank ids that occur in SILVA clusters). (See #139.) To do this the SILVA and Genbank inputs will need to be augmented with Genbank ids. Downstream the ids will need to be pruned so they don't look like taxa in OTT.

pmidford commented 9 years ago

I have a tool to generate a silva taxonomy file augmented with clusters at the tips. Two issues - this is not the silva input file the current process_silva tool uses (it uses the fasta file w/o taxonomy). The other issue is that silva taxonomy specifies lineages as sequences of names (sort of like IF, but silva doesn't provide any ids for these names). So a bunch of TNRS calls need to happen somewhere - is this something I should worry about, or can smasher do something with a lineage as a string with semicolon delimited names?

jar398 commented 9 years ago

Why not just use the existing SILVA code for turning the lineages into a tree? It works fine. If it's hard to read I can go over it and add some comments/documentation. In fact you should be able to use the whole script; just tear out anything that has to do with NCBI.

On Fri, Apr 24, 2015 at 7:26 PM, Peter Midford notifications@github.com wrote:

I have a tool to generate a silva taxonomy file augmented with clusters at the tips. Two problems - this is not the silva input file the current process_silva tool uses (it uses the fasta file w/o taxonomy). The other problem is that silva taxonomy specifies lineages as sequences of names (sort of like IF, but silva doesn't provide any ids for these names). So a bunch of TNRS calls need to happen somewhere - is this something I should worry about, or can smasher do something with a lineage as a string with semicolon delimited names?

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/reference-taxonomy/issues/140#issuecomment-96098784 .

pmidford commented 9 years ago

Issue isn't making the tree - we need identifiers for the internal nodes. The obvious options are just use the name strings, or generate something. What are the smasher's requirements for internal node ids? I recall you mentioned they don't need to be numerical, how important is stability in this situation.

On 4/24/15 7:58 PM, Jonathan A Rees wrote:

Why not just use the existing SILVA code for turning the lineages into a tree? It works fine. If it's hard to read I can go over it and add some comments/documentation. In fact you should be able to use the whole script; just tear out anything that has to do with NCBI.

On Fri, Apr 24, 2015 at 7:26 PM, Peter Midford notifications@github.com wrote:

I have a tool to generate a silva taxonomy file augmented with clusters at the tips. Two problems - this is not the silva input file the current process_silva tool uses (it uses the fasta file w/o taxonomy). The other problem is that silva taxonomy specifies lineages as sequences of names (sort of like IF, but silva doesn't provide any ids for these names). So a bunch of TNRS calls need to happen somewhere - is this something I should worry about, or can smasher do something with a lineage as a string with semicolon delimited names?

— Reply to this email directly or view it on GitHub

https://github.com/OpenTreeOfLife/reference-taxonomy/issues/140#issuecomment-96098784 .

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/reference-taxonomy/issues/140#issuecomment-96103540.

jar398 commented 9 years ago

The existing code makes up the names. Look at tax/silva/taxonomy.tsv

On Fri, Apr 24, 2015 at 8:43 PM, Peter Midford notifications@github.com wrote:

Issue isn't making the tree - we need identifiers for the internal nodes. The obvious options are just use the name strings, or generate something. What are the smasher's requirements for internal node ids? I recall you mentioned they don't need to be numerical, how important is stability in this situation.

On 4/24/15 7:58 PM, Jonathan A Rees wrote:

Why not just use the existing SILVA code for turning the lineages into a tree? It works fine. If it's hard to read I can go over it and add some comments/documentation. In fact you should be able to use the whole script; just tear out anything that has to do with NCBI.

On Fri, Apr 24, 2015 at 7:26 PM, Peter Midford <notifications@github.com

wrote:

I have a tool to generate a silva taxonomy file augmented with clusters at the tips. Two problems - this is not the silva input file the current process_silva tool uses (it uses the fasta file w/o taxonomy). The other problem is that silva taxonomy specifies lineages as sequences of names (sort of like IF, but silva doesn't provide any ids for these names). So a bunch of TNRS calls need to happen somewhere - is this something I should worry about, or can smasher do something with a lineage as a string with semicolon delimited names?

— Reply to this email directly or view it on GitHub

< https://github.com/OpenTreeOfLife/reference-taxonomy/issues/140#issuecomment-96098784

.

— Reply to this email directly or view it on GitHub < https://github.com/OpenTreeOfLife/reference-taxonomy/issues/140#issuecomment-96103540 .

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/reference-taxonomy/issues/140#issuecomment-96110591 .

jar398 commented 9 years ago

The version of process_silva.py on the 'membership' branch (commit e745d28) fixed the ncbi/genbank file processing code. Also simplified a few other things.

I'd rather not lose the processSilva() double loop; it's rather finely honed. Just the part after "parentid = taxid #for next iteration" needs to be replaced. It should be very simple, just emit a row for the particular cluster id ('accession') with taxid as the parent. There's a little bit of complication when there are multiple clusters per accession, because we're trimming off the within-sequence positions - may be necessary to keep a table to prevent duplicates. These can be flushed entirely, or all but one cluster can be emitted, but it won't work if more than one cluster is emitted with a single accession id. Not a big deal since there are only about 10 of these.

The higher taxa internal ids look like accession/#n where n is 1 for domain and increases with greater depth in the tree. These generate useable URLs in the SILVA taxonomy browser when prefixed properly. The accession number used for a higer taxon is the lexicographically earliest one encountered according to the sort order used by the script (shorter genbank ids first, then alphabetic/numerical). But that shouldn't affect anything you're working on.

On Fri, Apr 24, 2015 at 9:02 PM, Jonathan A Rees rees@mumble.net wrote:

The existing code makes up the names. Look at tax/silva/taxonomy.tsv

On Fri, Apr 24, 2015 at 8:43 PM, Peter Midford notifications@github.com wrote:

Issue isn't making the tree - we need identifiers for the internal nodes. The obvious options are just use the name strings, or generate something. What are the smasher's requirements for internal node ids? I recall you mentioned they don't need to be numerical, how important is stability in this situation.

On 4/24/15 7:58 PM, Jonathan A Rees wrote:

Why not just use the existing SILVA code for turning the lineages into a tree? It works fine. If it's hard to read I can go over it and add some comments/documentation. In fact you should be able to use the whole script; just tear out anything that has to do with NCBI.

On Fri, Apr 24, 2015 at 7:26 PM, Peter Midford < notifications@github.com> wrote:

I have a tool to generate a silva taxonomy file augmented with clusters at the tips. Two problems - this is not the silva input file the current process_silva tool uses (it uses the fasta file w/o taxonomy). The other problem is that silva taxonomy specifies lineages as sequences of names (sort of like IF, but silva doesn't provide any ids for these names). So a bunch of TNRS calls need to happen somewhere - is this something I should worry about, or can smasher do something with a lineage as a string with semicolon delimited names?

— Reply to this email directly or view it on GitHub

< https://github.com/OpenTreeOfLife/reference-taxonomy/issues/140#issuecomment-96098784

.

— Reply to this email directly or view it on GitHub < https://github.com/OpenTreeOfLife/reference-taxonomy/issues/140#issuecomment-96103540 .

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/reference-taxonomy/issues/140#issuecomment-96110591 .

pmidford commented 9 years ago

Ok, I think I've figured out that I was trying to do something rather more complicated than you were asking for. I've flushed all that, but after playing with things for a while, I still can't get the multiple clusters per accession in the way you described (I get thousands or none, but not 10) so I may be misunderstanding this as well. I'll wait with the pull request until the duplication is cleared up. I've updated the header comments to cover what the script does and doesn't do. The commit is process_silva_clusters.py on the branch of a similar (no suffix) name.

jar398 commented 9 years ago

I can't find an example, but the situation I was talking about was where a single Genbank accession/sequence, which in the cases I saw happened to be for a whole genome, contains multiple SSU sequences, and the SSU sequences are placed in two (could be more, but I only saw two) clusters. So cluster 1 might have reference sequence G12345.1000.1500 and cluster 2 might have reference sequence G12345.9000.9500. (or the two SSU sequences might just be in two different clusters and not be reference sequences.) The problem is that we're dropping the positions so both of these look like G12345 to us. I can't find details in my notes but I remember there being about ten cases like this among the reference sequences. Maybe if you start looking at all the sequences and not just the reference sequences there would be lots more, I don't know (since we didn't have that information at the time).

On Wed, Apr 29, 2015 at 7:35 PM, Peter Midford notifications@github.com wrote:

Ok, I think I've figured out that I was trying to do something rather more complicated than you were asking for. I've flushed all that, but after playing with things for a while, I still can't get the multiple clusters per accession in the way you described (I get thousands or none, but not 10) so I may be misunderstanding this as well. I'll wait with the pull request until the duplication is cleared up. I've updated the header comments to cover what the script does and doesn't do. The commit is process_silva_clusters.py on the branch of a similar (no suffix) name.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/reference-taxonomy/issues/140#issuecomment-97614278 .

OpenTreeOfLife / reference-taxonomy

Temporarily add Genbank ids to SILVA and NCBI taxonomy inputs #140