tutorial feedback - Githubissues

stolarczyk commented 4 years ago

I was able to successfully run the tutorial 🎉

FYI, I wanted to use the new PEP and pipeline interface formats, so I cloned the dev/cfg2 branches of our pipelines and looper. Additionally I used GenomicDistributions@dev to test the plots with recent updates.

With this software configuration I ended up with 10 bedfiles in Elasticsearch, so 5 samples failed. However, I think only GenomicDistributions discrepancy is actually relevant here, since all the submission scripts were produced successfully.

Link to my $BBTUTORIAL/outputs/bedstat_output/bedstat_pipeline_logs/looper_logs.txt

Here's some feedback:

[x] no need to cd $HOME at the beginning. I'd like to run this somewhere else
[x] no need to unzip the open signal matrices, data.table::fread supports reading gzipped files
[x] software names display style is still not consistent. Sometimes they are preformatted and sometimes not
[x] also pay attention to software names capitalization, e.g. Elasticsearch instead of elasticsearch
[x] use mkdir -p to create nested directories instead of creating dir by dir
[x] in "Run bedstat on the demo PEP" section: the largest chunk of text in the entire tutorial is devoted to explanation of bedstat run splitting (--no-db-commit and --just-db-commit), which
1. is not required for this tutorial to run
2. as discussed before, will be split into two Python scripts at some point
[x] again, find the first time you refer to software and add link there. An interested reader would have already looked that pu
[x] proposition: in the beginning briefly define what bedfile and especially bedset mean in our system
[x] update to PEP 2.0
[x] fix genomic distributions errors on qthist for fixed width files (?)

joseverdezoto commented 4 years ago

Glad the tutorial is running well! I just glanced over the looper log file. It looks like bedstat can't completely process those bed files because of an issue with GenomicDistributions, more specifically the plotQThist function.


Error in cut.default(dists, divisions, labels) : 'breaks' are not unique
Calls: doItAall ... grid.draw -> plotQTHist -> cutDists -> cut -> cut.default
In addition: Warning message:
Vectorized input to `element_text()` is not officially supported.
Results may be unexpected or may change in future versions of ggplot2. 
Execution halted ```

nsheff commented 4 years ago

probably there's not a wide enough distribution so it's duplicating breaks. that's a bug in GD, should create an issue.

stolarczyk commented 4 years ago

@joseverdezoto in the looper run command you added -R option.looper run has no -R option defined so it does nothing. Perhaps you wanted to pass the argument to the pipeline. The argument passing strategy has changed in looper v1.2.0.

I'll make the change in the code. Just wanted to point that out for future reference. See http://looper.databio.org/en/latest/parameterizing-pipelines/

joseverdezoto commented 4 years ago

I added that flag because I came across a warning that the pipeline wasn't properly shut down. That message suggested to run looper in -R mode. I'll keep that in mind.

stolarczyk commented 4 years ago

do you still have that log somewhere?

joseverdezoto commented 4 years ago

I don't think I do. I removed the entire tutorial produced folder when I ran it again. I'll let you know if I come across that warning again.

stolarczyk commented 4 years ago

I presume that the message you're referring to comes from pypiper:

https://github.com/databio/pypiper/blob/67908f2ee5f51fa5fdddb67eb6d7891aefeeda6a/pypiper/manager.py#L1099-L1103

it suggests to run the pipeline in recover mode, not looper. So using looper run --command-extra="-R" is the way to go

databio / bedhost

tutorial feedback #27