Open dievsky opened 5 years ago
Update: *summary* -> *-islands-summary-FDR*
. Otherwise it catches some intermediate files.
Also *scoreisland* -> *-W*-G*-E*.scoreisland
for the same reason.
Is this still an issue?
To the extent of my knowledge, we haven't agreed on a uniform approach to file export filtering yet.
To reiterate, there are currently two points of export filtering: first in the peak caller wrapper script (e.g. in macs2.sh
and sicer.sh
), then in pipeline (e.g. in run_macs2
and move_forward
).
A most obvious consistent approach would be to leave filtering at only one of these points. If that's not possible, we should agree on what is the expected output at both points.
Indeed having 2 different filtrations looks superfluous.
IMHO we'd better use filtration in corresponding tool bash script and move all the results directly to dedicated folder. Filtration within pipeline_chipseq.py
script was introduced to avoid source files moving.
At the moment I can think of the following - we can have code which will save initial list of files in the folder, launch the job and then copy all NEW files to dedicated folder.
At the moment after pipeline_test.sh
is executed we see a lot of extra files left in out/fastq_bams
folder (except .bam
and bowtie
logs):
OD1_k4me3_hg19-1-removed.bed
OD1_k4me3_hg19-W200-G600-FDR0.01-island.bed
OD1_k4me3_hg19-W200-G600-FDR0.01-islandfiltered-normalized.wig
OD1_k4me3_hg19-W200-G600-FDR0.01-islandfiltered.bed
OD1_k4me3_hg19-W200-G600-islands-summary
OD1_k4me3_hg19-W200-G600.scoreisland
OD1_k4me3_hg19-W200-normalized.wig
OD1_k4me3_hg19-W200.graph
OD3_k4me3_hg19-1-removed.bed
OD3_k4me3_hg19-W200-G600-FDR0.01-island.bed
OD3_k4me3_hg19-W200-G600-FDR0.01-islandfiltered-normalized.wig
OD3_k4me3_hg19-W200-G600-FDR0.01-islandfiltered.bed
OD3_k4me3_hg19-W200-G600-islands-summary
OD3_k4me3_hg19-W200-G600.scoreisland
OD3_k4me3_hg19-W200-normalized.wig
OD3_k4me3_hg19-W200.graph
YD1_k4me3_hg19-1-removed.bed
YD1_k4me3_hg19-W200-G600-FDR0.01-island.bed
YD1_k4me3_hg19-W200-G600-FDR0.01-islandfiltered-normalized.wig
YD1_k4me3_hg19-W200-G600-FDR0.01-islandfiltered.bed
YD1_k4me3_hg19-W200-G600-islands-summary
YD1_k4me3_hg19-W200-G600.scoreisland
YD1_k4me3_hg19-W200-normalized.wig
YD1_k4me3_hg19-W200.graph
YD3_k4me3_hg19-1-removed.bed
YD3_k4me3_hg19-W200-G600-FDR0.01-island.bed
YD3_k4me3_hg19-W200-G600-FDR0.01-islandfiltered-normalized.wig
YD3_k4me3_hg19-W200-G600-FDR0.01-islandfiltered.bed
YD3_k4me3_hg19-W200-G600-islands-summary
YD3_k4me3_hg19-W200-G600.scoreisland
YD3_k4me3_hg19-W200-normalized.wig
YD3_k4me3_hg19-W200.graph
od_input_hg19-1-removed.bed
od_input_hg19.bed
pileup
yd_input_hg19-1-removed.bed
yd_input_hg19.bed
Proposed solution following a Slack discussion:
Pending any objections, I'll start implementing this approach.
No objections here, having double check is error prone.
As a reminder: we are currently in process of evaluating Snakemake and its usefulness to us. For now it seems that it's a more concise and stable way to achieve the same result that we aim for with pipeline_chipseq.py
. If we decide to switch to Snakmake, this issue will technically become obsolete, so it's paused for now.
Completely agree with the previous comment.
Current implementation of ChIP-seq pipeline launches
sicer.sh
and filters its output through the pattern['*sicer.log', '*.bed', '*rip.csv']
. However,sicer.sh
used to have its own file filtering (recently removed in aftermath of #27 ). Due to this convoluted system, we now export the*-1-removed.bed
control files, but don't export thesummary
andscoreisland
files (which are actually SICER's main output).We should rework this approach so it would be logical and predictable. As far as I can see, other peak callers write output directly to the work folder, so they aren't affected by this problem.
For now, I'll change the export pattern in pipeline to
['*sicer.log', '*summary*', '*scoreisland*', '*rip.csv']
and update tests accordingly.