bokulich-lab / RESCRIPt

REference Sequence annotation and CuRatIon Pipeline
BSD 3-Clause "New" or "Revised" License
84 stars 26 forks source link

Sequence identifier warnings from `get-ncbi-data`. #158

Open mikerobeson opened 1 year ago

mikerobeson commented 1 year ago

A user initially reported this issue when running the following command:

qiime rescript get-ncbi-data   \
    --p-query "txid4751[ORGN] AND (ITS1 OR ITS2 OR its1 OR its2) NOT environmental sample[Filter] NOT environmental samples[Filter] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]" \
    --p-ranks kingdom phylum class order family genus species \
    --p-rank-propagation \
    --p-n-jobs 4 \
    --o-sequences ITS-ref-seqs-ng.qza \
    --o-taxonomy ITS-ref-tax-ng.qza \
    --verbose

Which resulted in the following errors:

WARNING:2023-05-10 08:31:04,095:MainProcess:Using pdb|8E5T|3 as a sequence identifier, because it did not come down with an accession version.
WARNING:2023-05-10 08:31:04,096:MainProcess:Using pdb|7V08|6 as a sequence identifier, because it did not come down with an accession version.
WARNING:2023-05-10 08:31:04,096:MainProcess:Using pdb|7UQZ|6 as a sequence identifier, because it did not come down with an accession version.
WARNING:2023-05-10 08:31:04,096:MainProcess:Using pdb|7UQB|6 as a sequence identifier, because it did not come down with an accession version.
...

I was able to reproduce the issue. I exported the resulting FASTA file and did observe sequences with headers like those shown above. I also manually ran BLAST on a few of the sequences, they did appear to contain ITS sequences, though I've not tested thoroughly. I am not sure why pdb identifiers are used, when the returned data might actually contain the requested ITS DNA sequences.

The warning message comes from specifically these lines from ncbi.py.

Probably not really a true issue, but it can be difficult to trace back the origin of these data.