BIMSBbioinfo / pigx_bsseq

bisulfite sequencing pipeline from fastq to methylation reports
https://bioinformatics.mdc-berlin.de/pigx/
GNU General Public License v3.0
11 stars 4 forks source link

re-execution being forced. #46

Closed Blosberg closed 6 years ago

Blosberg commented 6 years ago

From testing of the current master branch. Execution of the pipeline runs smoothly the first time (although there are some warnings like pandoc-citeproc: reference param not found). Output is all green (or black) until it reads: 50 of 50 steps (100%) done So far so good.

Then, however, if I immediately run the following again: $ ./pigx-bsseq test/samplesheet.csv -s test/settings.yaml --snakeparams " --reason "

it should read "Nothing to be done". It doesn't. Instead it shows the pretty pig logo again and then

Job 36: ----------  Converting hg19 Genome into Bisulfite analogue  ----------
Reason: Updated input files: path_links/refGenome/

My best guess is that the path_links creation is being forced (and therefore over-writing a pre-existing version, making it newer than other files that depend on it, and forcing their reexecution).

In pigx-bsseq:

try:
    os.symlink(config['locations']['genome-dir'],
               path.join(config['locations']['output-dir'], 'path_links/refGenome'))
except FileExistsError:
    pass

I don't think the try is doing what it was meant to do. I'm not really an expert on this though. What's really weird is that if I interrupt this run and then re-submit the pipeline again it restarts from a different point in the pipeline:

Job 23: Processing of bam file:
   input     : 06_sorted/SEsample_v2copy_se_bt2.deduped.sorted.bam

If I then resubmit the pipeline again arbitrarily many times I believe it continues to start from this point every time.

Blosberg commented 6 years ago

almost solved...

With the current update, if I've been running the pipeline recently (in which it has finished at least once) and I keep the installation directory unchanged (i.e. the in which I've configured and installed the pipeline), but just delete the test/out/ folder, and then run it once, when I then run it the 2nd time, then I get "Nothing to be done". Which is good.

However... If I completely delete the entire directory (nuke it from orbit) and install everything from scratch --i.e. set PIGX_UNINSTALLED again, ./bootstrap && ./configure && make, and then run it, the first time it goes through as normal, then I run it a second time and it starts from the point of converting the reference genome, before running to completion, and then (the third time) it says "Nothing to be done".

Presumably, then, we need to do the same thing with ref_genome links that we just did with the R markdown scripts. But this is definitely progress; I might not have time to look at this this weekend, but will try to check into it on Monday.

rekado commented 6 years ago

then I run it a second time and it starts from the point of converting the reference genome

Hasn't that happened the first time around already? If there really is a problem, why wouldn't it happen on the third run as well?

Blosberg commented 6 years ago

If there really is a problem, why wouldn't it happen on the third run as well?

No idea... It's weird, and I don't pretend to understand it; On Monday morning I'll dig into this more thoroughly and try to be more specific.

Blosberg commented 6 years ago

why wouldn't it happen on the third run as well?

ok, it's because the genome that we have saved in the test folder hasn't been methyl-converted, so the link is made, and then the genome is methyl-converted the first time, but somehow in an order that makes snakemake think it's not as new as it should be, so it has to do methyl-conversion over again --When it's done after something else (still not sure what) then it's happy and says "Nothing to be done". This isn't obvious if one is only deleting the test/out/ folder, so I recommend we do testing by obliterating the test directory entirely each time.

In principle this little hiccup shouldn't matter, since this is only the test data set, and any real runs will have their own reference genome, so we can either leave it as is (which still seems a bit unclean), or we add the methyl-conversion directories into the repository (which will bring the size of the directory up to 453K), or try to toggle the creation of the link to the refGenome so snakemake recognizes its timestamp. I'm looking into option #3 right now; if I see something easy I'll implement and test it.