genomic-medicine-sweden / jasen

Bacterial typing pipeline for clinical NGS data. Written in NextFlow, Python & Bash.
https://jasen.readthedocs.io/en/latest/
GNU General Public License v3.0
9 stars 11 forks source link

add subsampling of reads before de novo assembly #160

Open LordRust opened 1 year ago

LordRust commented 1 year ago

Since too many reads just introduce more error edges in the assembly graph, we should add a step for subsampling beofre assembly. seqtk would be the obvious speedy candidate for doing this, but there are others as well. Aside from producing better assemblies, it would also speed up the running time of course.

For regular genomic data I think 200x would be good starting point.

ryanjameskennedy commented 1 year ago

Just out of interest, I ran into this problem recently and SKESA gave an error saying:

Invalid file <expected_read_filename>
ryanjameskennedy commented 2 months ago

Example of how to run:

seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq