c-guzman / cipher-workflow-platform

A data processing platform for ChIP-seq, RNA-seq, MNase-seq, DNase-seq, ATAC-seq and GRO-seq datasets. Please ignore information on cipher.readthedocs.io, it is currently out of date. Follow information in README.
19 stars 5 forks source link

resample #12

Open jsalignon opened 6 years ago

jsalignon commented 6 years ago

Hi, I found out that the subsample process is pretty slow. I have tried the reformat.sh function to subsample 155M PE reads in just 100 PE reads and it takes more than 15 minutes on my local machine. I guess the function is scanning the entire file to keep representative sequences. But is it really necessary? Couldn't we just take the first N reads of the file? I have replaced this function by a very simple script below that takes seconds, that just takes the first N reads. lineNb = Math.round(params.subsampled_reads * 4)

gunzip -c ${read1} | head -${lineNb} > ${id}_R1.subsampled.fq
gzip ${id}_R1.subsampled.fq
gunzip -c ${read2} | head -${lineNb} > ${id}_R2.subsampled.fq
gzip ${id}_R2.subsampled.fq

Best, Jerome