gbouras13 / hybracter

Automated long-read first bacterial genome assembly tool implemented in Snakemake using Snaketool.
MIT License
108 stars 8 forks source link

add latest medaka models #84

Closed vdejager closed 2 months ago

vdejager commented 4 months ago

Is your feature request related to a problem? Please describe. I am getting an error if I request the latest medaka model matching my dorado basecall when using the docker/singularity container

dorado model : dna_r10.4.1_e8.2_400bps_sup@v5.0.0 medaka model: r1041_e82_400bps_sup_v5.0.0_model.tar.gz

https://github.com/nanoporetech/medaka/blob/master/medaka/data/

Describe the solution you'd like A clear and concise description of what you want to happen. scan the medaka models for new versions automatically. looks like there is a rule for it: hybracter/workflow/rules/download/download_medaka_models.smk

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered. rerun the basecalling with a supported model

Additional context

gbouras13 commented 4 months ago

Hi @vdejager ,

We have decided to not update Medaka from v1.8.0 (and therefore we won't be updating them with newer models) due to the fragility of the installation and also its lack of efficacy on newer data. You can read a bit more here https://github.com/gbouras13/hybracter?tab=readme-ov-file#v020-updates-26-october-2023---medaka-polishing-and---no_medaka

If you have Dorado v5 SUP data, I'd highly recommend using --no_medaka to turn of polishing. In the manuscript (and also on Ryan's blog https://rrwick.github.io on a number of occassions), we found that polishing Dorado v4.3 SUP assemblies made them worse, and I would imagine it is the same story for v5.

George

gwl2 commented 2 months ago

Hi, I'd also need the new models. I think it depends on the readout if you consider a polished assembly better or worse and probably also on the species and isolate (we observed isolate specific sequencing errors). The bugs i work with usually benefit from Medaka polishing and it hardly ever make it worse - but my readout is based on allele calling.

So could you implement an option that the user provides hybracter with an medaka conda environment to be used for polishing? In this way you can keep medaka fixed but the user has a workaround.

Here is my workaround to get a working medaka environment when its not available through conda: mamba create --name medaka_112_proper python=3.10 minimap2 samtools bcftools numpy=1.26.4 pyabpoa mamba activate medaka_112_proper pip install medaka

Thanks a lot, best Gabriel

gbouras13 commented 2 months ago

Hi @vdejager @gwl2 ,

I am trying the latest medaka via a pip install in the environment file - the issue is nanopore don't support bioconda builds anymore

@gwl2 , that solution won't do for broad distribution with hybracter as it is too complicated, but any keen users can certainly try it out (you can modify the medaka environment hybracter builds). Thanks for it!

George

gbouras13 commented 2 months ago

Hi @gwl2 @vdejager ,

I have updated to medaka v1.12.1 in the latest version of hybracter (v0.8.0), for Linux at least. Please give it a go.

George

gwl2 commented 2 months ago

@gbouras13 thanks a lot! I'd have an general question regarding the suggestion to modify hybracters medaka environment. The way hybracter handles conda environments in an uncommon way - at least the medaka environment is not directly in my conda environments, but seems to be a sub-environment/nested environments(?) of hybracter? How do you generate and activate such environments?

gbouras13 commented 2 months ago

@gwl2 ,

These are generated by snakemake when you run Hybracter. These need to be generated to run Hybracter, but once generated you can activate and modify them as you wish.

You can see the location in the snakemake output e.g. (https://github.com/gbouras13/hybracter/issues/62)

Error in rule medaka_incomplete:
    jobid: 41
    input: hybracter_out/processing/incomp_pre_polish/Sample2.fasta, hybracter_out/processing/qc/Sample2_filt_trim.fastq.gz
    output: hybracter_out/processing/incomplete/medaka_incomplete/Sample2/consensus.fasta, hybracter_out/versions/Sample2/medaka_incomplete.version, hybracter_out/supplementary_results/intermediate_incomplete_assemblies/Sample2/Sample2_medaka.fasta
    log: hybracter_out/stderr/medaka_incomplete/Sample2.log (check log file(s) for error details)
    conda-env: /Users/eidedtul/miniforge3/envs/base_osx-64/envs/hybracterENV/lib/python3.12/site-packages/hybracter/workflow/conda/3a238a896824eb2007e785c8a56e5932_
    shell:

        medaka_consensus -i hybracter_out/processing/qc/Sample2_filt_trim.fastq.gz -d hybracter_out/processing/incomp_pre_polish/Sample2.fasta -o hybracter_out/processing/incomplete/medaka_incomplete/Sample2 -m r1041_e82_400bps_sup_v4.2.0  -t 16 2> hybracter_out/stderr/medaka_incomplete/Sample2.log
        medaka --version > hybracter_out/versions/Sample2/medaka_incomplete.version
        cp hybracter_out/processing/incomplete/medaka_incomplete/Sample2/consensus.fasta hybracter_out/supplementary_results/intermediate_incomplete_assemblies/Sample2/Sample2_medaka.fasta
        touch hybracter_out/processing/incomplete/medaka_incomplete/Sample2/calls_to_draft.bam
        rm hybracter_out/processing/incomplete/medaka_incomplete/Sample2/calls_to_draft.bam
        touch hybracter_out/processing/incomplete/medaka_incomplete/Sample2/consensus_probs.hdf
        rm hybracter_out/processing/incomplete/medaka_incomplete/Sample2/consensus_probs.hdf

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Logfile hybracter_out/stderr/medaka_incomplete/Sample2.log:

and if you want to activate it, you can do it like e.g.:

conda activate /Users/eidedtul/miniforge3/envs/base_osx-64/envs/hybracterENV/lib/python3.12/site-packages/hybracter/workflow/conda/3a238a896824eb2007e785c8a56e5932_ 

George