aitgon / vtam

MIT License
3 stars 3 forks source link

Intermittent error when using taxassign with ncbi_nt #11

Open meglecz opened 3 years ago

meglecz commented 3 years ago

I have seen the issue [https://github.com/aitgon/vtam/issues/9] and taxassign now works with a small test file (test1.txt attached) both using ncbi-nt and a custom database (blastn: 2.9.0+)

However, analysing a large file (large_test.txt), it gets through with the custom database, but not with nt. The log is found in the nohup.txt file. The ncbi nt was downloaded 2021-04-26, and the taxonomy file created by 'vtam taxonomy' on the same day.

This issue resembles the one in mantis: [https://sourcesup.renater.fr/plugins/mantis/view_mantis.php?group_id=4876&pluginname=mantis]

nohup.txt test1.txt large_test.txt

aitgon commented 3 years ago

It looks like it does not like the sequence in the title line. "FASTA-Reader: Title ends with at least 20 valid nucleotide characters. Was the sequence accidentally put in the title line?"

meglecz commented 3 years ago

I do not see any error in the input files. It is a tsv file with sequences in the last column. VTAM produces the input fasta file to BLAST. I cannot check this.

aitgon commented 3 years ago

But I think that the fasta file is just the variant sequences in the id and the sequence field. You can try to reproduce the error with this variants.fasta

seq1 seq1 seq1 seq2 ...

And this command blastn -out blast_output.tsv -outfmt "6 qseqid sacc pident evalue qcovhsp staxids" -query variant.fasta -db nt -evalue 1e-05 -qcov_hsp_perc 80 -num_threads 8 -dust yes

meglecz commented 3 years ago

Using the sequences in the title lines seems OK, the message ‘FASTA-Reader: Title ends with at least 20 valid nucleotide characters. Was the sequence accidentally put in the title line?’ is just a warning. I have a correct output file, we using a small test file with 2 sequences.

blastn -out blast_output_small.tsv -outfmt "6 qseqid sacc pident evalue qcovhsp staxids" -query small.fasta -db /usr/local/ncbi_nt_2021-04-03/nt -evalue 1e-05 -qcov_hsp_perc 80 -num_threads 8 -dust yes
FASTA-Reader: Title ends with at least 20 valid nucleotide characters.  Was the sequence accidentally put in the title line?
FASTA-Reader: Title ends with at least 20 valid nucleotide characters.  Was the sequence accidentally put in the title line?

However, when I try to run the large fasta file (12625 sequneces), the job gets killed without any infomative message :

blastn -out blast_output_large_ids.tsv -outfmt "6 qseqid sacc pident evalue qcovhsp staxids" -query large_ids.fasta -db /usr/local/ncbi_nt_2021-04-03/nt -evalue 1e-05 -qcov_hsp_perc 80 -num_threads 8 -dust yes
zsh: killed     blastn -out blast_output_large_ids.tsv -outfmt  -query large_ids.fasta -db   

The result of first ca. 2200-2300 sequences are written to the outfile. I tried on 2 different desktops, the processes stopped around the same sequence, but not exactly the same. (2205 for one, 2298 for the other )

CPUand memory usage went up to nearly 100%, so it might be a problem with available resources.