When invoking the end_to_end workflow with a single set of Illumina paired-end reads, I have been getting an early error within the class fasta_process:Rmdup.
Looking at VirID code indicated in the traceback, I am concerned about two potential serious issues.
There appears to be an unintentional misuse of the CD-HIT suite within VirID, where cd-hit is being called on DNA sequences. cd-hit is intended for aminoacid sequences, while cd-hit-est is the version intended for DNA.
Although the CLI options do exist in cd-hit-2d/cd-hit-est-2d, the commands cd-hit/cd-hit-est do not possess the -i2 nor -o2 options. Consequently, the VirID call to cd-hit when two input files exist will always fail (see below)
Lastly, if the goal of this processing stage is to remove duplicate read pairs, other more computationally efficient means exist, such as the tool fastp -- which can also perform quality/adapter trimming.
Eg. the following would perform standard clean-up on raw reads and remove duplicates
When invoking the end_to_end workflow with a single set of Illumina paired-end reads, I have been getting an early error within the class
fasta_process:Rmdup
.Example call:
Looking at VirID code indicated in the traceback, I am concerned about two potential serious issues.
There appears to be an unintentional misuse of the CD-HIT suite within VirID, where
cd-hit
is being called on DNA sequences.cd-hit
is intended for aminoacid sequences, whilecd-hit-est
is the version intended for DNA.Although the CLI options do exist in
cd-hit-2d
/cd-hit-est-2d
, the commandscd-hit
/cd-hit-est
do not possess the-i2
nor-o2
options. Consequently, the VirID call tocd-hit
when two input files exist will always fail (see below)https://github.com/ZiyueYang01/VirID/blob/cd88b2977c279837e11ed76a2ceba0e4c4e29d22/VirID/external/fasta_process.py#L41-L42
That said, I do not think opting for
cd-hit-est-2d
is the correct choice. Rather, paired-end reads can be handled bycd-hit-est
as follows:Lastly, if the goal of this processing stage is to remove duplicate read pairs, other more computationally efficient means exist, such as the tool fastp -- which can also perform quality/adapter trimming.
Eg. the following would perform standard clean-up on raw reads and remove duplicates