Open AlexaBennett opened 3 years ago
Hi @AlexaBennett ,
Currently, the scripts populates all ranks regardless of presence in TaxID lineage
By default, it does. But there is a rank propagation option: to disable, use --p-no-rank-propagation
with the CLI or rank_propagation=False
with the API. For example:
>>> import qiime2 as q2, pandas as pd
>>> from qiime2.plugins import rescript as rs
>>> ids = q2.Metadata(pd.DataFrame(['KR045484.1'], index=pd.Index(['KR045484.1'], name='id'), columns=['col']))
>>> s1, t1, = rs.actions.get_ncbi_data(accession_ids=ids)
>>> t1.view(pd.Series)[0]
'k__Bacteria; p__Bacteria; c__Bacteria; o__Bacteria; f__Bacteria; g__uncultured; s__bacterium'
>>> s2, t2, = rs.actions.get_ncbi_data(accession_ids=ids, rank_propagation=False)
>>> t2.view(pd.Series)[0]
'k__Environmental samples; p__; c__; o__; f__; g__uncultured; s__bacterium'
>>> s3, t3, = rs.actions.get_ncbi_data(accession_ids=ids, rank_propagation=False, ranks=['superkingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'])
>>> t3.view(pd.Series)[0]
'sk__Bacteria; p__; c__; o__; f__; g__uncultured; s__bacterium'
The taxonomy does not perfectly match your desired taxonomy because disabling rank propagation yields the raw NCBI taxonomy (which in this case has no kingdom annotation as in example 2, and contains annotations guncultured; sbacterium), but as shown in the third example you can pick and choose your ranks.
Does that meet your need?
Additionally, would it be possible to allow for specification of accession number target range? E.x. 'KR045484.1:3-200'. Or, to specify only to retrieve the taxonomy?
I really like this idea, and it was on our radar at some point. @BenKaehler can you enlighten us as to whether there is a workaround for this and/or what it would take to implement?
specify only to retrieve the taxonomy?
Also a great idea. I don't think there would be an issue with adding an option to ignore the sequences, and e.g., get-silva-data
has an equivalent option for skipping the sequences.
We are very open to contributions if you are interested in tackling either of these features @AlexaBennett 😉
The taxonomy does not perfectly match your desired taxonomy because disabling rank propagation yields the raw NCBI taxonomy (which in this case has no kingdom annotation as in example 2, and contains annotations guncultured; sbacterium), but as shown in the third example you can pick and choose your ranks.
Does that meet your need?
I can make it work. I might have to use superkingdom as in your last example. If I am not mistaken 'superkingdom' is the same as 'domain'? Therefore, it is reasonable to use the later in a prokaryotic study without loss of information. 🤔
We are very open to contributions if you are interested in tackling either of these features @AlexaBennett 😉
Sure! I have an idea of tackle them, but I am a only a 3rd year graduate student coming from a Microbiology background. Would there be anyone I may contact if I have questions throughout the process?
If I am not mistaken 'superkingdom' is the same as 'domain'? Therefore, it is reasonable to use the later in a prokaryotic study without loss of information
correct and that sounds reasonable, but I cannot guarantee that this is uniform across the NCBI taxonomy... e.g., that annotations for superkingdom are consistently used across entries, so it would be worth checking after you get the data whether any superkingdom annotations are missing if that's a problem for your application.
Sure! I have an idea of tackle them, but I am a only a 3rd year graduate student coming from a Microbiology background. Would there be anyone I may contact if I have questions throughout the process?
Great! Yes, feel free to ask any questions here that are specific to RESCRIPt/this issue, and for more general development questions you can ask on teh QIIME 2 forum.
To get you started:
get-silva-data
to get started... the mechanism for skipping would be quite different within the method, but "on the surface" we should make the API align as much as possible.Keep us updated and let us know if you run into problems. Thanks!
Hi @AlexaBennett, very sorry for the slow response.
Unless I've missed something, it is fairly straightforward to download a range of accessions. I didn't understand your example, but this one appears to work as expected:
qiime rescript get-ncbi-data \
--p-query 'KR045200:KR045483[accn]' \
--o-sequences range-seqs.qza \
--o-taxonomy range-tax.qza \
--p-no-rank-propagation \
--verbose
The issue of avoiding downloading the sequences when downloading the taxonomies is a bit trickier.
At the moment the script downloads the sequences and the taxids for the given accessions in the same download.
It is probably possible to download the metadata around the accession ids without downloading the sequences, but I would have to play around with the download formats.
@nbokulich, I'm not sure that this is a good beginner issue. The unit tests on their own would very challenging for a beginner coder.
@AlexaBennett, how badly do you need to be able to download taxa without sequences? As far as I can see the only challenge might be performance, given that you could delete the sequences after you download them. Not saying that it wouldn't be a nice feature to have, just that there is an obvious workaround until we get the opportunity to implement it.
it is fairly straightforward to download a range of accessions
Great news, thanks for the example @BenKaehler !
But I think @AlexaBennett 's request is to trim to a specific position from within a genome. Is this already possible @BenKaehler ?
Oh, gotcha, that is covered by the "Unless I've missed something".
Adding that functionality would be a reasonable beginner issue.
But I think @AlexaBennett 's request is to trim to a specific position from within a genome. Is this already possible @BenKaehler ?
Yes, I am interested in the ability to trim to 1+ positions within the genome.
how badly do you need to be able to download taxa without sequences? As far as I can see the only challenge might be performance, given that you could delete the sequences after you download them. Not saying that it wouldn't be a nice feature to have, just that there is an obvious workaround until we get the opportunity to implement it.
This is the current workaround I am utilizing in the QIIME environment. However, my primary dilemma is a product of having to strip the positions from accession numbers. The resulting taxa file contains only accessions ids and taxa values. It has thus dereplicated accession numbers from organisms with multiple copy numbers. My current method to handle this is the following:
It is probably possible to download the metadata around the accession ids without downloading the sequences, but I would have to play around with the download formats.
The script I am currently developing extracts the TaxId from the eSummary (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi). That bypasses having to download the sequences. @BenKaehler, could that work in the current script?
Request to update the get-ncbi-data to format the taxon table as it would appear in SILVA. Currently, the scripts populates all ranks regardless of presence in TaxID lineage.
Example of actual:
Example of desired:
Additionally, would it be possible to allow for specification of accession number target range? E.x. 'KR045484.1:3-200'. Or, to specify only to retrieve the taxonomy? This utility would be ideal for those of us with a curated list of sequences from a HMM. It is common to retrieve sequences from genomic assemblies. It is also possible to have multiple copies with genetic variation between the genomic regions.