PlantandFoodResearch / MCHap

Polyploid micro-haplotype assembly using Markov chain Monte Carlo simulation.
MIT License
18 stars 3 forks source link

Investigate HTPC friendly method for parallelisation #90

Closed timothymillar closed 3 years ago

timothymillar commented 3 years ago

Current multi-core method requires a single master python process to start sub-processes. This means that the entire job needs to be submitted as a single chunk of HTPC resource. It would be good to have an option to run in a way that is more friendly for splitting into many jobs with a tool like asub.

The best way to achieve this is with an approach similar to freebayes. Things to work out

Output per process would be VCF with a single record, these can be merged with vcflib

timothymillar commented 3 years ago

Example usage:

# array job
for TARGET in $(cat targets.bed)
do
# unique name for VCF containing each record
VCF="${TARGET//'\t'/_}.vcf.gz"
cat << EOF
mchap assemble \
    --target "$TARGET" \
    --variants "variants.vcf.gz" \
    --reference "reference.fasta" \
    --sample-bams "sample_bams.txt" \
    --sample-ploidy "sample_ploidy.txt" \
    --best-genotype | bgzip > "$TMPDIR/$VCF"
EOF
done | asub -c 60 -j mchap-asub
timothymillar commented 3 years ago

Use of sample-bam map file also prompts the requirement for a sub-tool that can create such a file from a list of bam files and an optional list of samples.

timothymillar commented 3 years ago

Done in #100