databio / pepatac

A modular, containerized pipeline for ATAC-seq data processing
http://pepatac.databio.org
BSD 2-Clause "Simplified" License
54 stars 15 forks source link

Extract peak cleaning from `pepatac.py` into a `tools/` script #295

Open syntonym opened 2 weeks ago

syntonym commented 2 weeks ago

PEPATAC produces the PEPATAC_commands.sh and PEPATAC_log.md files, which makes it very easy to follow which commands were executed with which parameters and how each output file was exactly produced. Given the PEPATAC_command.sh file, it seems possible to execute that file and get the exact same results back. However, one step is directly implement in the pepatac.py file and is thus missing from PEPATAC_commands.sh and PEPATAC_log.md. In commit 461ae32c8ddf06bad5362aea1430b5dd714a3f3f this is the lines 2362-2384, containing the following code:

                df = pd.read_csv(peak_output_file, sep='\t', header=None,
                                 names=("V1","V2","V3","V4","V5","V6",
                                        "V7","V8","V9","V10"))
                nineNine = df['V5'].quantile(q=0.99)
                df.loc[df['V5'] > nineNine, 'V5'] = nineNine

                # rescale score to be between 0 and 1000
                df['V5'] = rescale(np.log(df['V5']), [0, 1000])

                cs = pd.read_csv(res.chrom_sizes, sep='\t', header=None,
                                 names=("V1","V2"))
                df = df.merge(cs, on="V1")
                df.columns = ["V1","V2","V3","V4","V5","V6",
                              "V7","V8","V9","V10","V11"]
                # make sure 'chromEnd' positions are not greater than the 
                # max chrom_size
                n = np.array(df['V3'].values.tolist())
                df['V3'] = np.where(n > df['V11'], df['V11'], n).tolist()

                df = df.drop(columns=["V11"])
                # ensure score is a whole integer value
                df['V5'] = pd.to_numeric(df['V5'].round(), downcast='integer')
                df.to_csv(temp.name, sep='\t', header=False, index=False)

Refactoring these lines out into a tools/ script and calling it through the normal mechanism would include the call into the PEPATAC_commands.sh and PEPATAC_log.md files, making it possible to track how each result file was produced exactly.

syntonym commented 2 weeks ago

There is also a postprocess step for the fseq and fseq2 peak caller in lines 2107-2121, which does the same. That step overrides the peak file instead of creating a temporary file.

donaldcampbelljr commented 1 week ago

Yes. This sounds good. We will attempt to add this with our next release. Also, please feel free to submit a PR if you would like to take a shot at it.