bioforensics / yeat

YEAT: Your Everyday Assembly Tool
Other
1 stars 0 forks source link

Calculating PacBio Hifi data sample's genome size with MASH #42

Open danejo3 opened 1 year ago

danejo3 commented 1 year ago

In #37, calculating the genome size of PacBio Hifi-reads were explored. One of the curious cases that came out of it was that the estimated size did not match the expected size. Sometimes the estimated size calculated by MASH was x2, x3, or more.

The data that was used for testing was Ecoli K12. Ecoli K12 has a genome size of ~4.6 mb (see paper here). It is a very well studied bacteria.

Below is a few MASH calculations.

command parameters estimated genome size coverage
mash sketch [fastq] -r 33.6 mb 40x
mash sketch [fastq] -k 32 -r 51.3 mb 24x
mash sketch [fastq] -k 32 -m 3 -r 11 mb 117x
mash sketch [fastq] -k 21 -m 3 -r 8.7 mb 144x

-r = read set -k = k-mer size -m = Minimum copies of each k-mer required to pass noise filter for reads.

While the estimation of the genome was quite off, the largest assembled contig was 4.46 mb with Canu (242 contigs) and 4.64 mb with Flye (1 contig)--close to the expected genome size.

Interestingly, Flye was able to assemble the entire ecoli genome!

In addition to ecoli data, tests on SARS-CoV-2 Omicron variant data was tested. This data was collected from PacBio's public data here.

The expected genome size of SARS-CoV-2 is about 29.8 kb to 29.9 kb. Like the ecoli data, MASH was estimating the genome size x2, x3, or more.

Ran the SARS-CoV-2 data with Canu only because the reads were too short for Flye. However, in order to run Canu with the dataset, the following flags were added to the assembly.

canu -pacbio-hifi seq/input/sample.fq.gz maxThreads=16 -p sample -d analysis/canu genomeSize=29.9k minReadLength=550 minOverlapLength=50 minInputCoverage=0.4 stopOnLowCoverage=0.4

After running the assembler, only 3 contigs were assembled with the longest having the length of 2900 bp--way too short.

Talked to the analysts about the dataset and it was determined that the dataset were extremely poor.

As of now, I am unsure why MASH is getting highly overestimated genome sizes.

A suggestion was made to downsample the reads because of the extreme depth of coverage that Hifi reads provides; however, even then, the estimated size was still off depending on the downsampled coverage.

downsample reads = (total reads * coverage) / (average read length)

Used seqtk to downsample.