ENCODE-DCC / atac-seq-pipeline

ENCODE ATAC-seq pipeline
MIT License
392 stars 174 forks source link

How to set the seed when subsampling reads #441

Open Al-Murphy opened 6 months ago

Al-Murphy commented 6 months ago

I have been using atac.subsample_reads but I would like to a seed for this analysis so I can ensure different reads are subsampled across different runs, is there a way to do this? I can't see it listed in the parameters?

Thanks!

Al-Murphy commented 6 months ago

Just as an update, I tried hard coding a different seed in subsample_ta_se in encode_lib_genomic.py (my data is single-ended). However, the resulting pooled bigwig file data was the exact same as when only setting atac.subsample_reads. So it appears that setting the seed is having no affect, not sure where else this is having an effect, I can't see it anywhere in the atac.wdl code?

Updated code:

# bash-only
    cmd = 'zcat -f {} | '
    if non_mito:
        # cmd += 'awk \'{{if ($1!="'+mito_chr_name+'") print $0}}\' | '
        cmd += 'grep -v \'^'+mito_chr_name+'\\b\' | '
    if subsample > 0:
        cmd += 'shuf -n {} --random-source=<(openssl enc -aes-256-ctr '
        #CHANGE HERE
        #cmd += '-pass pass:$(zcat -f {} | wc -c) -nosalt '
        cmd += '-pass pass:101 -nosalt '
        cmd += '</dev/zero 2>/dev/null) | '
        cmd += 'gzip -nc > {}'
        cmd = cmd.format(
            ta,
            subsample,
            ta_subsampled)
    else:
        cmd += 'gzip -nc > {}'
        cmd = cmd.format(
            ta,
            ta_subsampled)