A data processing platform for ChIP-seq, RNA-seq, MNase-seq, DNase-seq, ATAC-seq and GRO-seq datasets. Please ignore information on cipher.readthedocs.io, it is currently out of date. Follow information in README.
Hi,
I found out that the subsample process is pretty slow. I have tried the reformat.sh function to subsample 155M PE reads in just 100 PE reads and it takes more than 15 minutes on my local machine. I guess the function is scanning the entire file to keep representative sequences. But is it really necessary? Couldn't we just take the first N reads of the file?
I have replaced this function by a very simple script below that takes seconds, that just takes the first N reads.
lineNb = Math.round(params.subsampled_reads * 4)
gunzip -c ${read1} | head -${lineNb} > ${id}_R1.subsampled.fq
gzip ${id}_R1.subsampled.fq
gunzip -c ${read2} | head -${lineNb} > ${id}_R2.subsampled.fq
gzip ${id}_R2.subsampled.fq
Hi, I found out that the subsample process is pretty slow. I have tried the reformat.sh function to subsample 155M PE reads in just 100 PE reads and it takes more than 15 minutes on my local machine. I guess the function is scanning the entire file to keep representative sequences. But is it really necessary? Couldn't we just take the first N reads of the file? I have replaced this function by a very simple script below that takes seconds, that just takes the first N reads.
lineNb = Math.round(params.subsampled_reads * 4)
Best, Jerome