splaisan commented 5 years ago

Hi Simone,

I read more papers and found that there is a database (fasta) of full rrn loci that would be very interesting to classify my long amplicon PCR that include ITS and 23S and your tool

Could you please help me creating the two required artefacts from the fasta and annotations I fetched from: https://github.com/alfbenpa/rrn_db (last two files in the repo)

In particular, I would need help to create the taxonomy file, probably using code similar to your Import_database.sh

here the top of the annotation file which shows GB ID which hopefully will comply

Mycobacterium_africanum operon100_1 KK338483.1  4940
Staphylococcus_aureus   operon100_2 JBLJ01000007.1  4939
Staphylococcus_aureus   operon100_3 KI978154.1  4939
Staphylococcus_aureus   operon100_4 JBQV01000010.1  4943
Staphylococcus_aureus   operon100_5 KI975524.1  4806
Staphylococcus_aureus   operon100_6 KI974560.1  4834
....

typical fasta header which I guess needs to change to include the GB ID

>operon100_1
AGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGTCTCTTCGGAGAT
ACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCCTGCACTTCGGGATAAGCCTGGGAAACTGGGTCTAATA
...

If you prefer to redirect me to a better place to learn this, no problem.

MaestSi commented 5 years ago

Hi, I just wrote a small script that should to the job. rrn2ncbi.R.txt You can run it with:

source activate MetONTIIME_env
Rscript rrn2ncbi.R.txt operons.100.fa species_annotation operons.100_renamed.fa

After that you can run the Import_database.sh script on the operons.100_renamed.fa file. Let me know if it works! Simone

splaisan commented 5 years ago

fasta header conversion worked like a charm, thanks I am now running the Import script with minot mods, in particular an error when creating the conda env and not finding gcc in (conda create -n entrez_qiime_env python=2.7 gcc). I fixed it by creating the env without gcc then add with (conda install -c anaconda gcc) I think that this was my early issue few weeks ago but I overlooked the screen and did not see the message then. I open a separate ticket for that when I will have succeeded to run Import_database.sh completely. Thanks Simone

MaestSi commented 5 years ago

Hi Stephane, strange, I never saw that issue; by the way, I think that if you cd to MetONTIIME directory and run the Import_database.sh you don't need to recreate the conda environment and download the taxonomy files again.

splaisan commented 5 years ago

I found that a significant number of GB acc are absent in the nucl_gb.accession2taxid but present in the nucl_wgs.accession2taxid dump. because each dump misses quite many I made a merge of both an built on this

count of GB acc not matched

operons.100_gb.log:8012 operons.100_wgs.log:3755 operons.100_merged.log:283

The merge only misses 283 which is probably nicer, I will post the full process in a git page, including your R script if you do not mind.

I am now running MetONTIIME with the rrnDB artefacts and a slight change in the code.

#echo -e "\nqiime feature-classifier classify-consensus-blast"
#time qiime feature-classifier classify-consensus-blast \
#  --i-query rep-seqs.qza  \
#  --i-reference-reads $DB \
#  --i-reference-taxonomy $TAXONOMY \
#  --p-perc-identity 0.77 \
#  --p-query-cov 0.3 \
#  --p-maxaccepts 1 \
#  --o-classification taxonomy.qza"
#echo "# ${cmd}"
#eval $(time ${cmd})

echo -e "\nqiime feature-classifier classify-consensus-vsearch (24 threads)"
time qiime feature-classifier classify-consensus-vsearch \
  --i-query rep-seqs.qza  \
  --i-reference-reads $DB \
  --i-reference-taxonomy $TAXONOMY \
  --p-perc-identity 0.77 \
  --p-query-cov 0.8 \
  --p-maxaccepts 1 \
  --p-strand 'both' \
  --p-min-consensus 0.51 \
  --p-unassignable-label 'Unassigned' \
  --p-threads 24 \
  --o-classification taxonomy.qza

Did you evaluate replacing the veeeeeery slow blast (>20h for last run) with vsearch at that point?

Are my parameters OK (the added ones are default values to maybe play later with them)?

In particular, the --p-query-cov from your original code seems low since we have long reads that should cover the whole amplicon (right?)

MaestSi commented 5 years ago

Sure I don't mind if you include my R script! If you ask me, 20 hours don't look like that much to me, as compared to legacy-blast that I used before, that took approximately one week using all threads available for performing an analysis, on average! The EPI2ME 16S workflow is not that fast too, as far as I remember. I did not evaluate replacing Blast+ with Vsearch, as I wanted to keep everything as similar as possible to the EPI2ME 16S workflow, but the manuscript that I showed you some weeks ago would be a valid reason for switching from Blast+ to Vsearch without having unwanted side effects. For that same reason I did not change the query coverage, which is set at 30% for EPI2ME 16S workflow, but I agree that it seems a very low value. Moreover, since we are talking about the query coverage, namely the minimum percentage of a read that should align to an entry in the reference for being considered a hit, the fact that we are using the whole amplicon or just a portion of the 16S should not matter. To me it looks like your parameters are ok, let me know if you try it out and compare the results! Good that you found a way to include those missing GB entries too! Simone

splaisan commented 5 years ago

Hi Simone,

I am still playing with vsearch (a new version of qiime will embed it more efficiently for what I did but will also require that you alter your code a bit to pass more parameters) but meanwhile did some things with MetONTIIME and various databases using blastn. You can find my demo results in a PDF linked on my page https://github.com/Nucleomics-VIB/InSilico_PCR ('here' link a the bottom of the page).

You can also find some re-formatting of the qzv labels to show less text on the MetONTIIME Qiime plots otherwise quite ugly (https://github.com/splaisan/qiime-tools/blob/master/simplify_qiime-plots.md).

MaestSi commented 5 years ago

Hi Stephane, congrats, very interesting report! Let me know in case new Vsearch parameters have to be added to the script. Best, Simone

MaestSi / MetONTIIME

artefacts for rrn_db #5

count of GB acc not matched