Request for update to Taxon Table

AlexaBennett commented 3 years ago

Request to update the get-ncbi-data to format the taxon table as it would appear in SILVA. Currently, the scripts populates all ranks regardless of presence in TaxID lineage.

Example of actual:

Feature ID Taxon KR045484.1 kBacteria; pBacteria; cBacteria; oBacteria; fBacteria; guncultured; s__bacterium

Example of desired:

Feature ID Taxon KR045484.1 kBacteria; p; c; o; f; g; s__

Additionally, would it be possible to allow for specification of accession number target range? E.x. 'KR045484.1:3-200'. Or, to specify only to retrieve the taxonomy? This utility would be ideal for those of us with a curated list of sequences from a HMM. It is common to retrieve sequences from genomic assemblies. It is also possible to have multiple copies with genetic variation between the genomic regions.

nbokulich commented 3 years ago

Hi @AlexaBennett ,

Currently, the scripts populates all ranks regardless of presence in TaxID lineage

By default, it does. But there is a rank propagation option: to disable, use --p-no-rank-propagation with the CLI or rank_propagation=False with the API. For example:

>>> import qiime2 as q2, pandas as pd
>>> from qiime2.plugins import rescript as rs
>>> ids = q2.Metadata(pd.DataFrame(['KR045484.1'], index=pd.Index(['KR045484.1'], name='id'), columns=['col']))
>>> s1, t1, = rs.actions.get_ncbi_data(accession_ids=ids)
>>> t1.view(pd.Series)[0]
'k__Bacteria; p__Bacteria; c__Bacteria; o__Bacteria; f__Bacteria; g__uncultured; s__bacterium'
>>> s2, t2, = rs.actions.get_ncbi_data(accession_ids=ids, rank_propagation=False)
>>> t2.view(pd.Series)[0]
'k__Environmental samples; p__; c__; o__; f__; g__uncultured; s__bacterium'
>>> s3, t3, = rs.actions.get_ncbi_data(accession_ids=ids, rank_propagation=False, ranks=['superkingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'])
>>> t3.view(pd.Series)[0]
'sk__Bacteria; p__; c__; o__; f__; g__uncultured; s__bacterium'

The taxonomy does not perfectly match your desired taxonomy because disabling rank propagation yields the raw NCBI taxonomy (which in this case has no kingdom annotation as in example 2, and contains annotations guncultured; sbacterium), but as shown in the third example you can pick and choose your ranks.

Does that meet your need?

Additionally, would it be possible to allow for specification of accession number target range? E.x. 'KR045484.1:3-200'. Or, to specify only to retrieve the taxonomy?

I really like this idea, and it was on our radar at some point. @BenKaehler can you enlighten us as to whether there is a workaround for this and/or what it would take to implement?

specify only to retrieve the taxonomy?

Also a great idea. I don't think there would be an issue with adding an option to ignore the sequences, and e.g., get-silva-data has an equivalent option for skipping the sequences.

We are very open to contributions if you are interested in tackling either of these features @AlexaBennett 😉

AlexaBennett commented 3 years ago

The taxonomy does not perfectly match your desired taxonomy because disabling rank propagation yields the raw NCBI taxonomy (which in this case has no kingdom annotation as in example 2, and contains annotations guncultured; sbacterium), but as shown in the third example you can pick and choose your ranks.

Does that meet your need?

I can make it work. I might have to use superkingdom as in your last example. If I am not mistaken 'superkingdom' is the same as 'domain'? Therefore, it is reasonable to use the later in a prokaryotic study without loss of information. 🤔

We are very open to contributions if you are interested in tackling either of these features @AlexaBennett 😉

Sure! I have an idea of tackle them, but I am a only a 3rd year graduate student coming from a Microbiology background. Would there be anyone I may contact if I have questions throughout the process?

nbokulich commented 3 years ago

If I am not mistaken 'superkingdom' is the same as 'domain'? Therefore, it is reasonable to use the later in a prokaryotic study without loss of information

correct and that sounds reasonable, but I cannot guarantee that this is uniform across the NCBI taxonomy... e.g., that annotations for superkingdom are consistently used across entries, so it would be worth checking after you get the data whether any superkingdom annotations are missing if that's a problem for your application.

Sure! I have an idea of tackle them, but I am a only a 3rd year graduate student coming from a Microbiology background. Would there be anyone I may contact if I have questions throughout the process?

Great! Yes, feel free to ask any questions here that are specific to RESCRIPt/this issue, and for more general development questions you can ask on teh QIIME 2 forum.

To get you started:

This will be a useful resource for starting with QIIME 2 development: https://dev.qiime2.org/latest/
optional trimming based on position will only be available when passing in a list of IDs for explicit download, so the length ranges could be added as a column in that metadata file. To make this more generalizable, it might be worth making this a separate method for trimming sequences based on seq ID... then it would be possible to, e.g., get seqs using a query, then trim based on specific IDs.
For skipping download of sequences, take a look at get-silva-data to get started... the mechanism for skipping would be quite different within the method, but "on the surface" we should make the API align as much as possible.

Keep us updated and let us know if you run into problems. Thanks!

BenKaehler commented 3 years ago

Hi @AlexaBennett, very sorry for the slow response.

Unless I've missed something, it is fairly straightforward to download a range of accessions. I didn't understand your example, but this one appears to work as expected:

qiime rescript get-ncbi-data \
  --p-query 'KR045200:KR045483[accn]' \
  --o-sequences range-seqs.qza \
  --o-taxonomy range-tax.qza \
  --p-no-rank-propagation \
  --verbose

The issue of avoiding downloading the sequences when downloading the taxonomies is a bit trickier.

At the moment the script downloads the sequences and the taxids for the given accessions in the same download.

It is probably possible to download the metadata around the accession ids without downloading the sequences, but I would have to play around with the download formats.

@nbokulich, I'm not sure that this is a good beginner issue. The unit tests on their own would very challenging for a beginner coder.

@AlexaBennett, how badly do you need to be able to download taxa without sequences? As far as I can see the only challenge might be performance, given that you could delete the sequences after you download them. Not saying that it wouldn't be a nice feature to have, just that there is an obvious workaround until we get the opportunity to implement it.

nbokulich commented 3 years ago

it is fairly straightforward to download a range of accessions

Great news, thanks for the example @BenKaehler !

But I think @AlexaBennett 's request is to trim to a specific position from within a genome. Is this already possible @BenKaehler ?

BenKaehler commented 3 years ago

Oh, gotcha, that is covered by the "Unless I've missed something".

Adding that functionality would be a reasonable beginner issue.

AlexaBennett commented 3 years ago

But I think @AlexaBennett 's request is to trim to a specific position from within a genome. Is this already possible @BenKaehler ?

Yes, I am interested in the ability to trim to 1+ positions within the genome.

how badly do you need to be able to download taxa without sequences? As far as I can see the only challenge might be performance, given that you could delete the sequences after you download them. Not saying that it wouldn't be a nice feature to have, just that there is an obvious workaround until we get the opportunity to implement it.

This is the current workaround I am utilizing in the QIIME environment. However, my primary dilemma is a product of having to strip the positions from accession numbers. The resulting taxa file contains only accessions ids and taxa values. It has thus dereplicated accession numbers from organisms with multiple copy numbers. My current method to handle this is the following:

Extract the taxa file form QIIME
Use python to create a dictionary {accessionId: taxa}
Loop over original FASTA file utilizing regex to match the accession ids
Print the comment line from the FASTA with the corresponding taxa value
Import richer taxa file into QIIME

It is probably possible to download the metadata around the accession ids without downloading the sequences, but I would have to play around with the download formats.

The script I am currently developing extracts the TaxId from the eSummary (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi). That bypasses having to download the sequences. @BenKaehler, could that work in the current script?

bokulich-lab / RESCRIPt

Request for update to Taxon Table #93