cazzlewazzle89 / GROND

A quality-checked and publicly available database of full-length 16S-ITS-23S rRNA operon sequences
4 stars 0 forks source link

Error with feature-classifier extract-reads: "not a QIIME archive" #4

Open mmanus opened 5 days ago

mmanus commented 5 days ago

Hi there,

I am trying to use the GROND database for an analysis with 16S long read sequences. I am attempting to train the classifier starting with this code:

qiime feature-classifier extract-reads \ --i-sequences /DIRECTORYPATH/gtdb207nr_nrRep_B.fna \ --p-f-primer AGRGTTTGATYHTGGCTCAG \ --p-r-primer CCRAMCTGTCTCACGACG \ --p-min-length 1000 \ --o-reads /DIRECTORYPATH/reads_pairC.qza

However, I continue to receive the error "Invalid value for '--i-sequences': /DIRECTORYPATHt/gtdb207nr_nrRep_B.fna is not a QIIME archive."

In reading other posts about potential file issues, I have re-downloaded and unzipped the gtdb207nr_nrRep_B.fna.gz multiple times using multiple methods. Unfortunately this hasn't resolved the problem. I'm also unsure if the input file can be .fna or if it needs to be .qza. Do you have any advice about this? Thanks in advance!

cazzlewazzle89 commented 5 days ago

Hi @mmanus

I think the issue is that you first need to turn the database sequences into a qiime2 artefact.

The commanded needed will be something like this:

qiime tools import \
    --type 'FeatureData[Sequence]' \
    --input-path gtdb207nr_nrRep_B.fna \
    --output-path gtdb207nr_nrRep_B.qza

Then you can use your command above with that as input

qiime feature-classifier extract-reads \
    --i-sequences gtdb207nr_nrRep_B.qza \
    --p-f-primer AGRGTTTGATYHTGGCTCAG \
    --p-r-primer CCRAMCTGTCTCACGACG \
    --p-min-length 1000 \
    --o-reads /DIRECTORYPATH/reads_pairC.qza

Give that a whirl and let me know how you get on. I'm planning to do an update of GROND in the next few weeks - I have been slacking and not keeping up to date with GTDB

I will leave this issue open to serve as a reminder to ping you when I release that update.

Calum

mmanus commented 4 days ago

Thanks, Calum. I should have mentioned that I tried to convert the .fna to.qza but kept getting an error about it. I reproduced the error using the code that you suggested:

There was a problem importing gtdb207nr_nrRep_B.fna: gtdb207nr_nrRep_B.fna is not a(n) DNAFASTAFormat file: Invalid character 'c' at position 14 on line 6705 (does not match IUPAC characters for this sequence type). Allowed characters are ACGTRYKMSWBDHVN.

After reading a bit online, I added 'input-format' to deal with any upper/lower case issues - this resolved my problem and allowed me to proceed with converting to .qza and then re-training.

qiime tools import --type 'FeatureData[Sequence]' --input-format 'MixedCaseDNAFASTAFormat' --input-path gtdb207nr_nrRep_B.fna --output-path gtdb207nr_nrRep_B.qza

Please do let me know about any updates to GROND! Thanks!

cazzlewazzle89 commented 4 days ago

Thanks @mmanus for sorting - no idea why the FASTA is mixed case, I will fix that in the next update. Thanks for flagging.