BIMSBbioinfo / pigx_bsseq

bisulfite sequencing pipeline from fastq to methylation reports
https://bioinformatics.mdc-berlin.de/pigx/
GNU General Public License v3.0
11 stars 4 forks source link

Need to be able to print shell commands for final_report rule #139

Closed Blosberg closed 4 years ago

Blosberg commented 5 years ago

The final report rule in BS seq uses run instead of shell, which means we cannot see what's happening inside snakemake with the option printshellcmds.

This enormously increases the difficulty of debugging when issues are encountered (e.g. Issue #119).

We need to either change "run" to "shell" in this rule, or find some other way to output the command.

rekado commented 5 years ago

Note that the rule calls generateReports, which does use shell in the end.

Blosberg commented 5 years ago

yes, but at the level of job submission, --printshellcmds will not propagate into the generateReports function. If we're using a shell command anyway, then I don't see any reason why we can't just have this directly in the rule. What is the point of the dumps variable?

Blosberg commented 5 years ago

More generally though, the problem is just that it makes things non-standard. If all of our other output is using shell to produce output, then we should avoid changing that for just one specific rule, unless we really need to. Doing things the same way every time just keeps everything cleaner. irregularities are messy and invariably lead to problems like this.

Blosberg commented 5 years ago

resolved by commit https://github.com/BIMSBbioinfo/pigx_bsseq/commit/f7508cb0b5350b79652d310f5b15cb03f6cdd09c

Final report generation shell command is now sent to the snakemake log file.

rekado commented 5 years ago

More generally though, the problem is just that it makes things non-standard. If all of our other output is using shell to produce output, then we should avoid changing that for just one specific rule, unless we really need to. Doing things the same way every time just keeps everything cleaner. irregularities are messy and invariably lead to problems like this.

I disagree. There's nothing wrong with using whatever mechanism there is if it allows us to solve problems. We ended up using run for a reason. Calling it an irregularity makes it sound like there is no reason to do it this way in the first place. git blame may help you answer this question.

(Besides, if anything is messy it's using shell at all ;))

Blosberg commented 5 years ago

makes it sound like there is no reason to do it this way in the first place.

That was the main point of this question: why exactly it was being done this way. Git blame on the file didn't provide any illuminating answer on this --perhaps I missed something. Can you explain why this is run stuff is necessary? ( or post a more specific git blame command that would provide that answer)

alexg9010 commented 5 years ago

I think this solution of printing into stdout is very nice, only nice thing would be to make it conditional ( only when printShellCmd).

About the why are we using it?:

I think run is used to execute python code, which was needed in this case, because we built the command using python. Especially this dump:q was to place the json formatted params dict into the command.

The benefit of this functionality is that we do not have to explicitly have to hardcode the params of the two reports (final, diffmeth), but can define them in the rule and pass them as dict.

Blosberg commented 5 years ago

The benefit of this functionality is that we do not have to explicitly have to hardcode the params of the two reports (final, diffmeth), but can define them in the rule and pass them as dict.

ok, I'm a little foggy on some of the details in there, but that sounds like a sufficiently good reason for setting it up in this way.

only nice thing would be to make it conditional ( only when printShellCmd).

Agreed, that would be ideal, but passing the flag into the function is an extra step of work. Moreover, this output isn't going to screen; I just tested it and saw that this it is actually sent to the snakemake output log file for the job in question (which is, itself, going to be relegated to a sub-folder of pigx_work/), so it's not really cluttering up any output. The only time a user will see the output of this print statement is if they are specifically looking into the log files (in which case, it would usually make sense to see the command), so I'm ok with just leaving it as is.

Blosberg commented 5 years ago

Unfortunately, a simple print statement doesn't actually convey all the necessary information. This is what the output looks like currently:

==== The present shell command being submitted by the generateReport function is as follows: 
nice -19 /home/bosberg/projects/pigx/pigx_bsseq_production/guix/.guix-profile/bin/Rscript --vanilla {DIR_scripts}/generate_report.R --scriptsDir=/home/bosberg/projects/pigx/pigx_bsseq_production/scripts/ --reportFile={input.template} --outFile={output.report} --finalReportDir=/scratch/AG_Akalin/bosberg/cfDNA_cardiac_pigx_output/Final_Reports/ --report.params={dumps:q} --logFile={log}
==== End of shell command submitted by the generateReport function.

And then when the job crashes, the only additional information comes from the SM run:

Error in rule final_report:
    jobid: 121
    output: /scratch/AG_Akalin/bosberg/cfDNA_cardiac_pigx_output/Final_Reports/N2_RDML00564_HT2LKCCXY_L1_1_val_1_bt2.sorted.deduped_hg19_final.html
    log: /scratch/AG_Akalin/bosberg/cfDNA_cardiac_pigx_output/Final_Reports/N2_RDML00564_HT2LKCCXY_L1_1_val_1_bt2.sorted.deduped_hg19_final.log
    cluster_jobid: Your job 3166421 ("snakejob.final_report.121.sh") has been submitted

So at least we know what the output is, but it's still a massive pain to figure out what the inputs and all the parameters were in order to run the job manually in an R session.

Especially this dump:q was to place the json formatted params dict into the command.

Why do we need to dump this all in json format? We used to just include the parameters directly in the shell command.

alexg9010 commented 5 years ago

The dumping was because:

The benefit of this functionality is that we do not have to explicitly have to hardcode the params of the two reports (final, diffmeth), but can define them in the rule and pass them as dict.

If we have so many issues with the generate_report() function we might aswell trash it and write explicit shell commands inside the rule.

@rekado @katwre @Blosberg what do you think?

katwre commented 5 years ago

@alexg9010 I don't mind any, using explicit shell commands or run, if there are more advantages of using shell lets go for it. I remember that for final and diffmeth it was a bit tricky with the code and it was the only way to go (passing parameters) and so we had to use run.

Blosberg commented 5 years ago

Print ArgsL rds file to pigx_work, and then make a note of that fact in the SM message.

alexg9010 commented 4 years ago

should be fixed with https://github.com/BIMSBbioinfo/pigx_bsseq/pull/159