resample - Githubissues

Hi, I found out that the subsample process is pretty slow. I have tried the reformat.sh function to subsample 155M PE reads in just 100 PE reads and it takes more than 15 minutes on my local machine. I guess the function is scanning the entire file to keep representative sequences. But is it really necessary? Couldn't we just take the first N reads of the file? I have replaced this function by a very simple script below that takes seconds, that just takes the first N reads. lineNb = Math.round(params.subsampled_reads * 4)

gunzip -c ${read1} | head -${lineNb} > ${id}_R1.subsampled.fq
gzip ${id}_R1.subsampled.fq
gunzip -c ${read2} | head -${lineNb} > ${id}_R2.subsampled.fq
gzip ${id}_R2.subsampled.fq

Best, Jerome

c-guzman / cipher-workflow-platform

resample #12