ChIP-seq pipeline and double file filtering

dievsky commented 5 years ago

Current implementation of ChIP-seq pipeline launches sicer.sh and filters its output through the pattern ['*sicer.log', '*.bed', '*rip.csv']. However, sicer.sh used to have its own file filtering (recently removed in aftermath of #27 ). Due to this convoluted system, we now export the *-1-removed.bed control files, but don't export the summary and scoreisland files (which are actually SICER's main output).

We should rework this approach so it would be logical and predictable. As far as I can see, other peak callers write output directly to the work folder, so they aren't affected by this problem.

For now, I'll change the export pattern in pipeline to ['*sicer.log', '*summary*', '*scoreisland*', '*rip.csv'] and update tests accordingly.

dievsky commented 5 years ago

Update: *summary* -> *-islands-summary-FDR*. Otherwise it catches some intermediate files.

dievsky commented 5 years ago

Also *scoreisland* -> *-W*-G*-E*.scoreisland for the same reason.

olegs commented 5 years ago

Is this still an issue?

dievsky commented 5 years ago

To the extent of my knowledge, we haven't agreed on a uniform approach to file export filtering yet. To reiterate, there are currently two points of export filtering: first in the peak caller wrapper script (e.g. in macs2.sh and sicer.sh), then in pipeline (e.g. in run_macs2 and move_forward). A most obvious consistent approach would be to leave filtering at only one of these points. If that's not possible, we should agree on what is the expected output at both points.

olegs commented 5 years ago

Indeed having 2 different filtrations looks superfluous. IMHO we'd better use filtration in corresponding tool bash script and move all the results directly to dedicated folder. Filtration within pipeline_chipseq.py script was introduced to avoid source files moving. At the moment I can think of the following - we can have code which will save initial list of files in the folder, launch the job and then copy all NEW files to dedicated folder.

olegs commented 5 years ago

At the moment after pipeline_test.sh is executed we see a lot of extra files left in out/fastq_bams folder (except .bam and bowtie logs):

OD1_k4me3_hg19-1-removed.bed
OD1_k4me3_hg19-W200-G600-FDR0.01-island.bed
OD1_k4me3_hg19-W200-G600-FDR0.01-islandfiltered-normalized.wig
OD1_k4me3_hg19-W200-G600-FDR0.01-islandfiltered.bed
OD1_k4me3_hg19-W200-G600-islands-summary
OD1_k4me3_hg19-W200-G600.scoreisland
OD1_k4me3_hg19-W200-normalized.wig
OD1_k4me3_hg19-W200.graph
OD3_k4me3_hg19-1-removed.bed
OD3_k4me3_hg19-W200-G600-FDR0.01-island.bed
OD3_k4me3_hg19-W200-G600-FDR0.01-islandfiltered-normalized.wig
OD3_k4me3_hg19-W200-G600-FDR0.01-islandfiltered.bed
OD3_k4me3_hg19-W200-G600-islands-summary
OD3_k4me3_hg19-W200-G600.scoreisland
OD3_k4me3_hg19-W200-normalized.wig
OD3_k4me3_hg19-W200.graph
YD1_k4me3_hg19-1-removed.bed
YD1_k4me3_hg19-W200-G600-FDR0.01-island.bed
YD1_k4me3_hg19-W200-G600-FDR0.01-islandfiltered-normalized.wig
YD1_k4me3_hg19-W200-G600-FDR0.01-islandfiltered.bed
YD1_k4me3_hg19-W200-G600-islands-summary
YD1_k4me3_hg19-W200-G600.scoreisland
YD1_k4me3_hg19-W200-normalized.wig
YD1_k4me3_hg19-W200.graph
YD3_k4me3_hg19-1-removed.bed
YD3_k4me3_hg19-W200-G600-FDR0.01-island.bed
YD3_k4me3_hg19-W200-G600-FDR0.01-islandfiltered-normalized.wig
YD3_k4me3_hg19-W200-G600-FDR0.01-islandfiltered.bed
YD3_k4me3_hg19-W200-G600-islands-summary
YD3_k4me3_hg19-W200-G600.scoreisland
YD3_k4me3_hg19-W200-normalized.wig
YD3_k4me3_hg19-W200.graph
od_input_hg19-1-removed.bed
od_input_hg19.bed
pileup
yd_input_hg19-1-removed.bed
yd_input_hg19.bed

dievsky commented 5 years ago

Proposed solution following a Slack discussion:

each peak caller wrapper moves forward all newly generated files (i.e. files not present during the wrapper launch);
no other filtering is done whatsoever.

Pending any objections, I'll start implementing this approach.

olegs commented 5 years ago

No objections here, having double check is error prone.

dievsky commented 5 years ago

As a reminder: we are currently in process of evaluating Snakemake and its usefulness to us. For now it seems that it's a more concise and stable way to achieve the same result that we aim for with pipeline_chipseq.py. If we decide to switch to Snakmake, this issue will technically become obsolete, so it's paused for now.

olegs commented 5 years ago

Completely agree with the previous comment.

JetBrains-Research / washu

ChIP-seq pipeline and double file filtering #75