Open vinicius-santos-bmc opened 1 month ago
Thanks, there is indeed one problem in the blast.py,we have fixed it in the new version (v2.0.2).
args = ['diamond','blastx', '-q', origin_file, '-d', self.database_path, '-o',
output_tsv, '-e', '1E-4', '--query-gencode',str(self.translate_table), '-k', str(1), '-p', str(self.threads),'-f',str(6)]
In addition, we use fastp, which is able to process sequence data faster.
I'm still having a problem, but now after the contamination-removal process:
fastp -i SRR10078305_1.fastq -I SRR10078305_2.fastq -o virid_out_path/assembly_and_basic_annotation/step3_QC_1.fq -O virid_out_path/assembly_and_basic_annotation/step3_QC_2.fq -h virid_out_path/assembly_and_basic_annotation/step3_QC.html --detect_adapter_for_pe --dedup --dup_calc_accuracy 4 --dont_eval_duplication --low_complexity_filter --thread 24
fastp v0.23.4, time used: 60 seconds
[2024-10-09 22:01:16] INFO: [assembly_and_basic_annotation] Remove rRNA
[2024-10-09 22:06:40] INFO: [assembly_and_basic_annotation] Use megahit to splice reads into contigs
[2024-10-09 22:08:44] INFO: [assembly_and_basic_annotation] Running diamond blastx to compare /home/vinisantos/anaconda3/envs/mamba/envs/virid/lib/python3.12/site-packages/VirID/data/diamond_database/RdRP_230330_rmdup
[2024-10-09 22:08:46] INFO: [assembly_and_basic_annotation] Running diamond blastx to compare /data/databases/blastdb_08032023/nr/nr
[2024-10-10 00:45:57] INFO: [assembly_and_basic_annotation] Remove contigs that cannot be translated into longer amino acid contigs.
[2024-10-10 00:46:19] INFO: [assembly_and_basic_annotation] Contigs annotation
[2024-10-10 00:51:06] INFO: [assembly_and_basic_annotation] Cut the sequence contamination at both ends of contigs
[2024-10-10 01:10:30] TASK: END Primary Screen PART...
[2024-10-10 01:10:30] ERROR: Controlled exit resulting from early termination.
Do you have the NT database ?
This step helps us to trim fragments from the host.
If you don't need it, you must add --no_trim_contamination
to skip this step.
Yes, I do, but I was using the nt_core database, which has virus sequences. I'll try to use --no_trim_contamination. Maybe in future versions, to avoid having to download the heavy versions of nt_euk and nt_prok for removing contamination, you could ask the user for the host ID NCBI genome fasta and then use them to map with bowtie2 and remove the host contamination. And there are also smaller databases for lab contamination removal. I think this would a be best option.
If you use the nt_core database which has virus sequences, VirID will delete all viral contigs.
By the way, it's may not cause by the trim_contamination
part, you can check whether the step10_blastn_trimed.fasta
( result of trim_contamination
part) in assembly_and_basic_annotation
.
You can also check the contents of the RPM_abundance
folder, which will help us to find out the problem.
In the meantime, thank you for your suggestions, we may improve this part in the next update.
The second stage does not include filtering out known viral sequences based on NT blastn results. In my analysis, it appears that the known viruses were present in the first stage but were removed in the second stage of building the evolutionary tree. This result contradicts the expected workflow described in the article.
If you use the nt_core database which has virus sequences, VirID will delete all viral contigs. By the way, it's may not cause by the
trim_contamination
part, you can check whether thestep10_blastn_trimed.fasta
( result oftrim_contamination
part) inassembly_and_basic_annotation
. You can also check the contents of theRPM_abundance
folder, which will help us to find out the problem. In the meantime, thank you for your suggestions, we may improve this part in the next update.
The file step10_blastn_trimed.fasta
exists, but it's empty. And there is no RPM_abundance
folder. So probably the error occurred before that.
I had also a problem when using the --no_trim_contamination
argument:
[2024-10-12 12:21:38] INFO: [assembly_and_basic_annotation] Remove rRNA
[2024-10-12 12:27:16] INFO: [assembly_and_basic_annotation] Use megahit to splice reads into contigs
[2024-10-12 12:29:02] INFO: [assembly_and_basic_annotation] Running diamond blastx to compare /home/vinisantos/anaconda3/envs/mamba/envs/virid/lib/python3.12/site-packages/VirID/data/diamond_database/RdRP_230330_rmdup
[2024-10-12 12:29:04] INFO: [assembly_and_basic_annotation] Running diamond blastx to compare /data/databases/blastdb_08032023/nr/nr
[2024-10-12 15:06:31] INFO: [assembly_and_basic_annotation] Remove contigs that cannot be translated into longer amino acid contigs.
[2024-10-12 15:06:57] INFO: [assembly_and_basic_annotation] Contigs annotation
[2024-10-12 15:11:50] INFO: Summary results
[2024-10-12 15:11:51] ERROR: Uncontrolled exit resulting from an unexpected error.
================================================================================
EXCEPTION: KeyError
MESSAGE: "['longest_aa_length'] not in index"
________________________________________________________________________________
Traceback (most recent call last):
File "/home/vinisantos/anaconda3/envs/mamba/envs/virid/lib/python3.12/site-packages/VirID/__main__.py", line 55, in main
gt_parser.parse_options(args)
File "/home/vinisantos/anaconda3/envs/mamba/envs/virid/lib/python3.12/site-packages/VirID/main.py", line 141, in parse_options
self.assembly_and_basic_annotation(options)
File "/home/vinisantos/anaconda3/envs/mamba/envs/virid/lib/python3.12/site-packages/VirID/main.py", line 80, in assembly_and_basic_annotation
Summary(options,assembly_and_basic_annotation_path,"Primary_screen_res.tsv").run()
File "/home/vinisantos/anaconda3/envs/mamba/envs/virid/lib/python3.12/site-packages/VirID/rvm/summary.py", line 82, in run
final_output = nvi[['qseqid','NR_qlen','longest_aa_length','NR_sseqid','protein','NR_Virus','super_group','kindom','phylum','class','order','family','genus','species','virus_type']]
~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vinisantos/anaconda3/envs/mamba/envs/virid/lib/python3.12/site-packages/pandas/core/frame.py", line 4108, in __getitem__
indexer = self.columns._get_indexer_strict(key, "columns")[1]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vinisantos/anaconda3/envs/mamba/envs/virid/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 6200, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "/home/vinisantos/anaconda3/envs/mamba/envs/virid/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 6252, in _raise_if_missing
raise KeyError(f"{not_found} not in index")
KeyError: "['longest_aa_length'] not in index"
================================================================================
Will this be fixed soon or is there no way to use the tool?
We've fixed the bugs involved in --no_trim_contamination
, you can check out the logs on github.
Hi! I had this issue:
What could I be doing wrong?