Adding support to do basic pacbio hifi-read assembly with Flye and Canu

danejo3 commented 1 year ago

The purpose of this PR is to begin tackling #35 .

In this PR, the ability to do basic assembly of PacBio HiFi-reads was introduced. To assemble these reads, two new algorithms were added: 1) Canu and 2) Flye.

To run these algorithms, YEAT requires an input flag stating what kind of reads are to be assembled.

For example,

yeat --pacbio [hifi_data]

or when assembling paired-end short read data,

yeat --paired [short_read1] [short_read2]

When running Canu, users will need add genomeSize=N in the extra_args "Canu" section of the config. This parameter is in order to run the assembler.

In this PR, YEAT still relies on the config file to fine-tune the assembler's algorithm.

For example, in the config, users can adjust the correctedErrorRate and other additional flags for Canu.

[
    {
        "algorithm": "canu",
        "extra_args": "correctedErrorRate=0.075 -trimmed -corrected"
    }
]

The major backbone of this PR is the snakemake file: Pacbio.smk. In this PR, we created a separate snakemake file from the paired-end reads assembly (Paired.smk) because of the differences in the number of required input and output expected files for each rule. In the future, creating a class to return back the inputs and outputs of each rule will help consolidate these snakemake file.

Overall, the goal of this PR is to layout the basic workflow for pacbio hifi-read assembly.

danejo3 commented 1 year ago

UPDATE: Outdated and decided not to use subcommands.

~~Alrighty! I've got a pretty good PR ready for a prelim-review. @standage~~

~~Here are some example usages:~~

~~Example for short read assembly~~ ~~yeat -o sandbox1 short --paired short1.gz short2.gz config.cfg~~

~~Example for long read assembly (using Flye; I only have Flye available for long-read assembly as of now.) yeat -o sandbox2 long --pacbio long.gz config.cfg~~

~~Notice the specific ordering that the flags and inputs must be in.~~

~~yeat [options] {subcommand} {type of data} {read(s)} {config file}~~

~~To view the help info for each subcommand:~~

~~yeat {short | long} -h~~

~~As of right now, YEAT is only supporting long-read fastq files. I've learned that PacBio instruments produce 3 different kinds of files: 1) .bam, 2) .fasta, and 3) .fastq.~~

~~Feel free to explore the new CLI.~~

~~I have enforced people to specify the type of reads that they are assembling and what kind it is as well.~~

Something that I would like to explore more are following avenues in the future.

Short-read assembly: Paired-end, Single-end, and Interleaved-end pairs Long-read assembly: PacBio, Nanopore, and Sanger Hybrid assembly

danejo3 commented 1 year ago

Long read ecoli test data was grabbed from here:

curl -L -o ecoli.fastq https://sra-pub-src-1.s3.amazonaws.com/SRR10971019/m54316_180808_005743.fastq.1

This dataset was used in the Canu tutorial.

The size of the file is ~2-3 GB. When gzipped, it drops down to 1.32 GB.

Not ideal for adding this test data to the package.

standage commented 1 year ago

Took about 90 minutes to run make testhifi on my machine, but the tests passed. I'll wait to merge until you've responded re: genome size estimation.

bioforensics / yeat

Adding support to do basic pacbio hifi-read assembly with Flye and Canu #37