AfshinLab / BLR

MIT License
5 stars 0 forks source link

Enable running merging/rerunning previous runs #35

Closed pontushojer closed 4 years ago

pontushojer commented 4 years ago

Fix https://github.com/FrickTobias/BLR/issues/228.

Add option initialize workdir based on files generated in previous run(s). This allows one to use intermediate files for running new parameters or to merge different datasets.

Example 1: Merge BAMs from different runs to increase coverage

We have analysed datasets1 and dataset2 separately using the standard setup but want to merge to increase coverage. Then we can initialize a new run based on these analyses.

$ blr init -w /path/to/dataset1-analysis -w /path/to/dataset2-analysis -l blr dataset1-dataset2-merged
SETTINGS FOR: init (version: 0.1.2.dev58+gbce7b21.d20200723)
 reads1: None
 library_type: blr
 from_workdir: [PosixPath('/path/to/dataset1-analysis'), PosixPath('/path/to/dataset2-analysis')]
 directory: dataset1-dataset2-merged
config - INFO: Changing value of 'library_type': None --> blr.
init - INFO: Directory dataset1-dataset2-merged initialized.
init - INFO: Edit dataset1-dataset2-merged/blr.yaml.
init - INFO: Run 'cd dataset1-dataset2-merged && blr run anew' to start the analysis.

This creates a directory dataset1-dataset2-merged with the configs blr.yaml as usual but also contains a folder calls inputs were files from the previous analysis are softlinked. These can then be used to setup the analysis folder (after configs have been updated) by running:

$ blr run anew 

This will generate all required files to rerun steps following basic processing (i.e. variant calling, phasing, etc). Specifically it will merge the key files: barcode.clstr, final.molecule_stats.filtered.tsv and final.bam. Once this is finished the merged data is ready for running further analysis.

NB! For the merging to work properly the different runs should have different sample_nr defined from the configs. This tags the barcode with an integer that stops barcodes that overlap between datasets from being considered as the same in downstream analysis.

Example 2: Test different variant caller

We have called variants using freebayes and want to try gatk on datasetA. We initialize a new analysis folder based on the old run.

$ blr init -w /path/to/datasetA-freebayes -l blr datasetA-gatk

Similarly as for the merged runs we setup the require files by running blr run anew. In this case the files from the old run will be used directly to generate the files for the new analysis.

Edit: I updated the implementation with https://github.com/NBISweden/BLR/pull/35/commits/fd4538f5e1b951ca1cced9a52db663dd4b594de8 so that it runs faster (multi-thread enabled) and you don't need to call blr run twice too setup and run. Instead including from_partial will run a separate workflow to setup the run folder before starting the regular command.

Edit2: Renamed trigger for running workflow from "from_partial" to "anew" which is shorter more to the point.