Class Rmdup calling cd-hit rather than cd-hit-est - Githubissues

ZiyueYang01 / VirID

VirID: An integrated platform for the discovery and characterization of RNA Viruses

MIT License

10 stars 4 forks source link

Class Rmdup calling cd-hit rather than cd-hit-est #7

Open cerebis opened 1 month ago

cerebis commented 1 month ago

When invoking the end_to_end workflow with a single set of Illumina paired-end reads, I have been getting an early error within the class fasta_process:Rmdup.

Example call:

VirID end_to_end --threads 32 -i xxx_R1.fq.gz -i2 xxx_R2.fq.gz -out_dir test_out

Looking at VirID code indicated in the traceback, I am concerned about two potential serious issues.

There appears to be an unintentional misuse of the CD-HIT suite within VirID, where cd-hit is being called on DNA sequences. cd-hit is intended for aminoacid sequences, while cd-hit-est is the version intended for DNA.
Although the CLI options do exist in cd-hit-2d/cd-hit-est-2d, the commands cd-hit/cd-hit-est do not possess the -i2 nor -o2 options. Consequently, the VirID call to cd-hit when two input files exist will always fail (see below)

https://github.com/ZiyueYang01/VirID/blob/cd88b2977c279837e11ed76a2ceba0e4c4e29d22/VirID/external/fasta_process.py#L41-L42

That said, I do not think opting for cd-hit-est-2d is the correct choice. Rather, paired-end reads can be handled by cd-hit-est as follows:

cd-hit-est -i input_R1.fq -j input_R2.fq -o output_R1.fq -op output_R2.fq

Lastly, if the goal of this processing stage is to remove duplicate read pairs, other more computationally efficient means exist, such as the tool fastp -- which can also perform quality/adapter trimming.

Eg. the following would perform standard clean-up on raw reads and remove duplicates

fastp -i input_R1.fq -I input_R2.fq -o output_R1.fq -O output_R2.fq --dedup

ZiyueYang01 commented 1 month ago

Thanks, it is indeed cd-hit-est, we've made a correction .

fastp may indeed be a better option and we will seriously consider it at the end of our testing.