[Feature request] Optional automatic genome size estimation

richardstoeckl commented 2 months ago

Hi George,

I wanted to discuss the option of adding an 'auto' mode to the -c parameter.

In my opinion, the assembly of novel prokaryotes or at least the assembly of genomes that will be taxonomically classified AFTER they have been assembled, is a common use case of assemblers in general. Therefore, it doesn't surprise me, that both shovill and dragonflye contain a genome size estimation step.

Both shovill and dragonflye use KMC for genome size estimation based on k-mer counting (see Line 136 in shovill and Line 214 in dragonflye). Both use the basic but well known formula to estimate the genome size: For a given sequence of length L, and a k-mer size of k, the total k-mer’s possible will be given by ( L – k ) + 1 (see this paper or for example this tutorial).

A way to implement this in snakemake would be something like this:

# Since kmc is unable to deal with tabs in the fastq headers (see https://github.com/refresh-bio/KMC/issues/42#issuecomment-340614580),
# we need to sanitize the input reads before running kmc. So here we replace tabs with underscores. Maybe hybracter does this already
rule sanitizeInputReads:
    input:
        longReads = lookup(
            query="sampleID == '{sampleID}'",
            within=sampleDF,
            cols="longReadFastq",
        ),
    output:
        longReads = temp(os.path.join(INTERIMPATH, "{sampleID}", "sanitized", "{sampleID}_sanitized.fastq.gz")),
    conda:
        os.path.join(workflow.basedir, "envs", "assembly.yaml")
    shell:
        """
        seqkit replace -p "\t" -r '_' {input.longReads} -o {output.longReads}
        """

# this is the actual rule that runs kmc. It needs to be defined as a checkpoint, so that the helper function can access the output easily
checkpoint kmc:
    input:
        longReads = rules.sanitizeInputReads.output.longReads,
    output:
        kmcLOG = os.path.join(RESULTPATH,"{sampleID}", "genomeSize", "{sampleID}_kmcLOG.txt"),
        dir = temp(directory(os.path.join(RESULTPATH,"{sampleID}", "genomeSize"))),
    shadow: "shallow"
    conda:
        os.path.join(workflow.basedir, "envs", "assembly.yaml")
    shell:
        "kmc -k21 -ci10 -v -fq {input.longReads} {wildcards.sampleID} {output.dir} > {output.kmcLOG} 2>&1"

# This is the helper function that opens the kmc log file and reads the number of unique k-mers
# For a given sequence of length L,  and a k-mer size of k, the total k-mer’s possible will be given by ( L – k ) + 1.
# This can be used to estimate the genome size by counting the number of unique k-mers in the sample.
# Since the genome length is much bigger than the k-mer size (L >> 21), the genome size can be estimated as the number of unique k-mers.
# Since the the estimated genome size is used to check if the chromosome assembly is complete or not, we use 80% of the estimated genome size to be on the safe side. This could obviously be adjusted.
def get_kmers(sampleID):
    with checkpoints.kmc.get(sampleID=sampleID).output.kmcLOG.open() as file:
        for line in file:
            if "No. of unique counted k-mers" in line:
                return int(float(re.search(r"No. of unique counted k-mers\s*:\s*([\d\.eE+-]+)", line).group(1)) * 0.8)

# Yes, I am running a snakemake/snaketool workflow within a snakemake workflow. Here I just want to use it as an example on how to use the helper function (see params: chromosomeSize)
rule hybracter_long:
    input:
        longReads=lookup(
                query="sampleID == '{sampleID}' & group == 'ONTonly'",
                within=sampleDF,
                cols="longReadFastq",
        ),  
# this is added, so that kmc is run before this rule
        kmc= lambda wildcards: checkpoints.kmc.get(sampleID=wildcards.sampleID).output.kmcLOG,
    output:
        dir=directory(os.path.join(RESULTPATH, "{sampleID}", "hybracter_long")),
        summary=os.path.join(RESULTPATH, "{sampleID}", "hybracter_long", "FINAL_OUTPUT", "hybracter_summary.tsv"),
    threads: workflow.cores * 1
    log:
        os.path.join(LOGPATH, "{sampleID}", "logs", "{sampleID}_hybracter_long.log"),
    conda:
        os.path.join(workflow.basedir, "envs", "assembly.yaml")
    params:
# this uses the helper function to get the genome size (or 80% of that number, see above)
        chromosomeSize = lambda wildcards: str(get_kmers(wildcards.sampleID)),
        plassemblerDatabase=config["tools"]["plassembler"]["dbpath"],
    shell:
        """
        hybracter long-single \
        -l {input.longReads} \
        -s {wildcards.sampleID} \
        -c {params.chromosomeSize} \
        -o {output.dir} \
        -t {threads} \
        --databases {params.plassemblerDatabase} \
        --no_medaka > {log} 2>&1
        """

I know and appreciate that you want to keep hybracter as simple and stable as possible, and adding an additional dependency like KMC would add complexity. However, I think that this could add a lot of value in terms of being able to automate the assembly process. Also, KMC is actively developed, so any bugs that could cause problems will also be fixed.

Best wishes, Richard

gbouras13 commented 2 months ago

Hi @richardstoeckl ,

This is a very reasonable request.

Give you have done 90% of the implementation here , I certainly can't say no! - hopefully this week I'll find some time to push an update with this and a couple of other things, including the archaea reorientation :)

George

gbouras13 commented 2 months ago

@richardstoeckl ,

I had a look at this this arvo. Are you getting reasonable 21mer counts for your data? They are a massive overcount for me (even with pretty new Nanopore data). With R9 SUP I get 9.5M for s aureus (should be 2.8M), and with R10.4.1 I get 4.5M for e faecalis (should be 3M). Therefore, I am reticent to include this (especially for old nanopore data!)

Disregard ^^ I wasn't using -ci10 - works fine after I do that :)

George

gbouras13 commented 2 months ago

@richardstoeckl - your code is amazing. Seems to be working out of the box. I'll run it through the CI test suite and merge it - feel free to make a cosmetic PR if you want to be recognised on the repo as a contributor, I'll gladly merge it - thanks again!

George

richardstoeckl commented 2 months ago

@richardstoeckl - your code is amazing. Seems to be working out of the box. I'll run it through the CI test suite and merge it - feel free to make a cosmetic PR if you want to be recognised on the repo as a contributor, I'll gladly merge it - thanks again!

George

That would be an honor! Glad to be able to help improve this amazing pipeline :)

gbouras13 / hybracter

[Feature request] Optional automatic genome size estimation #90