bailey-lab / MIPTools

A suite of computational tools used for molecular inversion probe design, data processing, and analysis.
https://miptools.readthedocs.io
MIT License
6 stars 9 forks source link

Freebayes fails with 1 thread #33

Closed arisp99 closed 2 years ago

arisp99 commented 2 years ago

Bug Description

When the freebayes_caller.py script is called from within the container, with the default settings, the script crashes. I traced this to the number of threads option. When the number of threads is set to 1 (the default), the script fails. As the script fails this will likely mean that freebayes_call() is unable to run with 1 thread and some internal function is causing this bug.

Suggested Solution

We should set the default number of threads to a different number. We could use 20, which is the default in the Jupyter notebooks.

Reprex

Follow the analysis pipeline guide and run through the Jupyter app until you reach the Freebayes step. At this point, you can run the freebayes_caller.py script.

# This command fails
singularity exec \
    -B test-data/DR1_project_resources:/opt/project_resources \
    -B test-data/pf_species_resources:/opt/species_resources \
    -B test-data/wrangler:/opt/data \
    -B test-data/variant:/opt/analysis \
    miptools_v0.4.0.sif \
    python /opt/src/freebayes_caller.py --threads=1

# This command works
singularity exec \
    -B test-data/DR1_project_resources:/opt/project_resources \
    -B test-data/pf_species_resources:/opt/species_resources \
    -B test-data/wrangler:/opt/data \
    -B test-data/variant:/opt/analysis \
    miptools_v0.4.0.sif \
    python /opt/src/freebayes_caller.py --threads=2
arisp99 commented 2 years ago

After some more investigation, I was able to trace this issue to the bwa_multi() function in these lines of code: https://github.com/bailey-lab/MIPTools/blob/515276ded84a5b3ab21e6d5a7df56aea31f45cc3/src/mip_functions.py#L2412-L2419

Every time, the loop is run, a new option, indicating the number of threads is appending to the current options list. Moreover, the option is formatted incorrectly and the output looks like this:

print(options)
#> ['mem', '-t', '20']

options.extend("-t" + str(processor_number))
print(options)
#> ['mem', '-t', '20', '-', 't', '1']

The correct list should like

print(options)
#> ['mem', '-t', '20', '-t', '1']

# OR
print(options)
#> ['mem', '-t', '20', '-t 1']
aydemiro commented 2 years ago

We need to move the extend line outside of the loop and enclose in square brackets.

if parallel_processes == 1:
    options.extend(["-t" + str(processor_number)])
    for f in fastq_files:
        # get base file name
        base_name = f.split(".")[0]
        bam_name = base_name + extension
        bwa(f, bam_name, output_type, fastq_dir, bam_dir, options, species,
             base_name)
arisp99 commented 2 years ago

I was just about to commit a very similar solution!

Instead of using extend, my proposed solution was:

if parallel_processes == 1:
        # Set number of processors
        options = options + ["-t " + str(processor_number)]

        # Run bwa on each fastq file
        for f in fastq_files:
            # get base file name
            base_name = f.split(".")[0]
            bam_name = base_name + extension
            bwa(f, bam_name, output_type, fastq_dir, bam_dir, options, species,
                base_name)

I actually have the commit ready to push if this is fine with you @aydemiro. I tested it out already and it works!

aydemiro commented 2 years ago

Sure. This should work just fine. Feel free to push.