DecodeGenetics / Ratatosk

Hybrid error correction of long reads using colored de Bruijn graphs
BSD 2-Clause "Simplified" License
94 stars 7 forks source link

adjustable kmer size #12

Closed HenrivdGeest closed 4 years ago

HenrivdGeest commented 4 years ago

I am using quite long illumina-quality like reads for a plant genome as the short reads input. However the correction on whole genome scale does not work accurate enough to really make a difference. My long reads are corrected only partially. One thing I am thinking is that since my short reads are quite large, upto a kilobase, and therefore the debruijn graph could be build using larger kmer sizes that de default 31/63 (?) combo. I think the 63 if is fine for mapping the long reads back, but the illumina kmer size for building the graph could be bigger. Any chance this can be increased? (either by me changing the source code or via a parameter). Thanks so far for this great software package! A question related, can I read from the logs also something on the unitig size that is constructed? That might give me a hint already. I can image that for human genome this is quite a bit bigger than for a repetitive plant.

GuillaumeHolley commented 4 years ago

Hi @HenrivdGeest,

Providing advanced options to modify k1 and k2 individually would not be an issue. Something to consider is that the size of the k-mer used to build the graph is also the size of the matches between the graph and the long reads. So if you would pick k1>31, this would definitely increase the contiguity of the graph on the first correction pass but it would also decrease the probability of finding shared k1-mers between the graph and the long reads. It is not necessarily a problem though: You would get less matches for the first correction pass but the better graph contiguity would probably make up for it. It's a trade off you would have to play with.

You could also consider 2-3 runs of Ratatosk where the output corrected reads of run x are the input long reads of run x+1. You could have k1/k2=31/63 on the first run, then k1/k2=95/121 on the second run, etc.

Finally, something to consider is the insert size. By default, Ratatosk uses 500bp as short read insert size but given the length of your reads, you might want to increase that to 1kbp. I can also add this as an advanced parameter.

Let me now if any of this is of interest to you.

HenrivdGeest commented 4 years ago

Hi, thanks, I think both options would be nice. I do like your suggestion on doing it with a 2 pass, to circumvent that reads might be missed in the first pass. I would disable trimming for the first pass(es) and enable it for the last pass. I need to enable it because I see many chimeric long reads in my set, which really hamper the assembler (hifiasm) Another thing, my short read data is single end. I already cut the reads in 300bp reads, but that gave worse performance. So I am using the reads as single reads of around ~1kb. I am also wondering what happens with a part of a read which is highly repetitive. Will it be polished by the consensus of the repeat or will it not be polished?

jelber2 commented 4 years ago

You might be interested in this: https://github.com/chhylp123/hifiasm/issues/33

Also, BBMap/BBTools (https://sourceforge.net/projects/bbmap/) has a chimeric read detector called icecreamfinder.sh that might help (not sure).

GuillaumeHolley commented 4 years ago

Alright, I'm going to work on that today and hopefully, it will be available by the end of the day. Regarding the length of your "short" reads, Ratatosk considers paired-end reads as well as single end reads so I would definitely keep your short reads the way they are for better contiguity and color information in the graph. In the README file, I'm writing that Ratatosk works best with paired-end reads only for the general case where it is expected the length of the Illumina reads are 150bp +/- 100bp. Regarding your repeat question, it all depends on the number of matches found between the graph and the long read in the repeat region. A repeat is going to create cycles and/or very branching subgraphs so Ratatosk needs enough k-mer matches (exact or inexact) in that region to get the length of the repeat right. That's something that we want to improve for the next release by giving the repeats special treatment.

GuillaumeHolley commented 4 years ago

Hi @HenrivdGeest and @jelber2 ,

I have pushed a new version of Ratatosk that comes with a few new advanced options. As we discussed, there are 3 options of interest to you:

   -i, --insert_sz                 Insert size of the input short reads (default: 500)
   -k, --k1                        Length of short k-mers for 1st pass correction (default: 31)
   -K, --k2                        Length of long k-mers for 2nd pass correction (default: 63)

In your case, I would use -i 1000 and set -k/-K with whatever lengths you want to play with. If you want to use -k/-K larger than 63, you need to recompile Ratatosk with an extra compile option. It's very easy and all described here. Note that if you decide to do so, I would advise to make a new fresh clone of Bifrost (delete your local version and pull the new one) because my cmake install could use a bit of work.

Finally, I haven't tested changing any of those advanced parameters so let me know if it works and if you see any improvements.

jelber2 commented 4 years ago

@GuillaumeHolley great! I look forward to comparing --k1 31 and --k1 41 and its impact on assemblies!

GuillaumeHolley commented 4 years ago

Let me know how it goes :)

jelber2 commented 4 years ago

Comparison of simulated PacBio CLR reads corrected with Illumina for a Drosophila melanogaster genome assembly (haploid assembly "converted" to diploid with ~2% heterozygosity). Using Ratatosk commit 9e247317

# Use Drosophila melongaster PacBio assembly
cd /genetics/elbers/test/fly2
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/401/745/GCA_003401745.1_ASM340174v1/GCA_003401745.1_ASM340174v1_genomic.fna.gz

# Use BCFtools 1.10.2 and SAMtools 1.10
conda activate bcftools1.10.2

# Use Seqtk to convert soft-masked bases to upper-case bases, also compress with bgzip
# https://github.com/lh3/seqtk
/genetics/elbers/bin/seqtk/seqtk seq -U \
GCA_003401745.1_ASM340174v1_genomic.fna.gz | \
bgzip -@75 > GCA_003401745.1_ASM340174v1_genomic.fna_upper.fasta.gz

# Convert to diploid with approximately 2% heterozygosity rate, max indels=20bp
# mutate.sh part of BBTools/BBMap https://sourceforge.net/projects/bbmap/
/genetics/elbers/bbmap-38.86/mutate.sh \
in=GCA_003401745.1_ASM340174v1_genomic.fna_upper.fasta.gz \
ow=t \
vcf=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.vcf.gz \
out=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz \
ziplevel=6 \
ploidy=2 \
subrate=0.0192 \
indelrate=0.001 \
maxindel=20 \
nohomopolymers=t \
hetrate=1 2> GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.log.txt

# genome size
bgzip -@75 -cd GCA_003401745.1_ASM340174v1_genomic.fna_upper.fasta.gz | \
grep -v ">"|wc -m
# 140687135

# 2% heterozygosity is how many bases
calc 0.02\*140687135
# 0.02*140687135 = 2813742.700000

# number of mutations added
bgzip -@75 -cd GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.vcf.gz| \
grep -v "^#"|wc -l
# 2830139

# ~2% het rate

# get only 1 haplotype from the "diploid" reference
bgzip -@75 -dc GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz|\
/genetics/elbers/bin/seqtk/seqtk seq -L0|paste - - |grep -P "haplo_0\t"| \
tr '\t' '\n' |\
/genetics/elbers/bin/seqtk/seqtk seq -L60 |\
bgzip -@75 \
> GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.haplotype0.fasta.gz

# make reference for randomreads.sh
# randomreads.sh part of BBTools/BBMap https://sourceforge.net/projects/bbmap/
/genetics/elbers/bbmap-38.86/randomreads.sh build=1 \
ref=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz \
illuminanames=t addslash=t \
pacbio=t pbmin=0.13 pbmax=0.17 \
reads=100 paired=f \
gaussianlength=t \
minlength=1000 midlength=20000 maxlength=100000 \
out=/dev/null

# make 60x haploid coverage for Illumina reads
/genetics/elbers/bbmap-38.86/randomreads.sh build=1 \
ref=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz \
illuminanames=t addslash=t \
coverage=30 paired=t maxinsert=550 mininsert=450 \
out1=illumina1.fastq.gz out2=illumina2.fastq.gz > random_reads_illumina.log 2>&1

# make 30x haploid coverage for PacBio CLR reads
# error rate from 13 - 15 % minimum 1000bp midlength 20000bp maximum 30000bp
cd /genetics/elbers/test/fly2

/genetics/elbers/bbmap-38.86/randomreads.sh build=1 \
ow=t seed=1 \
ref=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz \
illuminanames=t addslash=t \
pacbio=t pbmin=0.13 pbmax=0.15 \
coverage=15 paired=f \
gaussianlength=t \
minlength=1000 midlength=20000 maxlength=30000 \
out=pacbio.fastq.gz > random_reads_pacbio.log 2>&1

# Ratatosk correction of raw PacBio CLR reads with initial k=41
/genetics/elbers/bin/Ratatosk/build/src/Ratatosk -v --k1 41 -c 75 \
-s illumina1.fastq.gz -s illumina2.fastq.gz -l pacbio.fastq.gz \
-o ratatosk.k41.pacbio > ratatosk.k41.log 2>&1

# De novo assemble Ratatosk corrected k=41 reads with Flye 2.8.1 
# using --pacbio-corr and --keep-haplotypes options
# https://github.com/fenderglass/Flye/archive/2.8.1.tar.gz
rm -fr /genetics/elbers/test/fly2/flye
mkdir -p /genetics/elbers/test/fly2/flye
cd /genetics/elbers/test/fly2/flye
/genetics/elbers/Flye-2.8-1/bin/flye --pacbio-corr \
../ratatosk.k41.pacbio.fasta --keep-haplotypes --out-dir ./ \
--threads 75 > flye.log 2>&1

# this failed to assemble because median overlap rate was > 3%

# De novo assemble Ratatosk corrected k=41 reads with Flye --pacbio-raw
# and --keep-haplotypes options
cd /genetics/elbers/test/fly2/flye
/genetics/elbers/Flye-2.8-1/bin/flye --pacbio-raw \
../ratatosk.k41.pacbio.fasta --keep-haplotypes --out-dir ./ \
--threads 75 > flye-raw.log 2>&1

# Ratatosk correction of raw PacBio CLR reads with initial k=31
/genetics/elbers/bin/Ratatosk/build/src/Ratatosk -v --k1 31 -c 75 \
-s illumina1.fastq.gz -s illumina2.fastq.gz -l pacbio.fastq.gz \
-o ratatosk.k31.pacbio > ratatosk.k31.log 2>&1

# De novo assemble Ratatosk corrected k=31 reads with Flye --pacbio-corr 
# and --keep-haplotypes options
rm -fr /genetics/elbers/test/fly2/flye2
mkdir -p /genetics/elbers/test/fly2/flye2
cd /genetics/elbers/test/fly2/flye2
/genetics/elbers/Flye-2.8-1/bin/flye --pacbio-corr \
../ratatosk.k31.pacbio.fasta --keep-haplotypes --out-dir ./ \
--threads 75 > flye.log 2>&1

# Ratatosk correction of Ratatosk corrected k=31 reads with initial k=41
/genetics/elbers/bin/Ratatosk/build/src/Ratatosk -v --k1 41 -c 75 \
-s illumina1.fastq.gz -s illumina2.fastq.gz -l ratatosk.k31.pacbio.fasta \
-o ratatosk.k31.k41.pacbio > ratatosk.k31.k41.log 2>&1

# De novo assemble Ratatosk corrected k=31 reads recorrected with k=41
# with Flye --pacbio-corr and --keep-haplotypes options
rm -fr /genetics/elbers/test/fly2/flye3
mkdir -p /genetics/elbers/test/fly2/flye3
cd /genetics/elbers/test/fly2/flye3
/genetics/elbers/Flye-2.8-1/bin/flye --pacbio-corr \
../ratatosk.k31.k41.pacbio.fasta --keep-haplotypes --out-dir ./ \
--threads 75 > flye.log 2>&1

# De novo assemble raw PacBio reads
# with Flye --pacbio-raw and --keep-haplotypes options
rm -fr /genetics/elbers/test/fly2/flye4
mkdir -p /genetics/elbers/test/fly2/flye4
cd /genetics/elbers/test/fly2/flye4
/genetics/elbers/Flye-2.8-1/bin/flye --pacbio-raw \
../pacbio.fastq.gz --keep-haplotypes --out-dir ./ \
--threads 75 > flye-run.log 2>&1

# Make copies of Flye assemblies
cd /genetics/elbers/test/fly2
cp /genetics/elbers/test/fly2/flye/assembly.fasta flye-pacbio-raw-keep-haplo-ratatosk.k41.fasta
cp /genetics/elbers/test/fly2/flye2/assembly.fasta flye-pacbio-corr-keep-haplo-ratatosk.k31.fasta
cp /genetics/elbers/test/fly2/flye3/assembly.fasta flye-pacbio-corr-keep-haplo-ratatosk.k31.k41.fasta
cp /genetics/elbers/test/fly2/flye4/assembly.fasta flye-pacbio-raw-keep-haplo.fasta

# Run Quast 5.1.0rc1 on haplotype0 of reference and flye results
# https://github.com/ablab/quast
# Remember that I converted GCA_003401745.1 to "diploid" with ~2% het rate
mkdir -p /genetics/elbers/test/fly2/quast5
cd /genetics/elbers/test/fly2/quast5
conda activate quast5.0.2
# still using QUAST 5.1.0rc1, just dependencies from quast 5.0.2 conda install
/genetics/elbers/quast-5.1.0rc1/quast.py --fragmented --fast --threads 75 \
-r ../GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.haplotype0.fasta.gz \
--eukaryote -o ./ ../flye-pacbio-corr-keep-haplo-ratatosk.k31.fasta \
../flye-pacbio-raw-keep-haplo-ratatosk.k41.fasta \
../flye-pacbio-corr-keep-haplo-ratatosk.k31.k41.fasta \
../flye-pacbio-raw-keep-haplo.fasta \
> quast.log 2>&1 &
conda deactivate

# QUAST results but rename assemblies
# so flye-pacbio-raw-keep-haplo.fasta=flye_raw
# so flye-pacbio-corr-keep-haplo-ratatosk.k31.fasta=k31
# so flye-pacbio-raw-keep-haplo-ratatosk.k41.fasta=k41
# so flye-pacbio-corr-keep-haplo-ratatosk.k31.k41.fasta=k31.k41
perl -pe "s/flye_pacbio_corr_keep_haplo_ratatosk.//g" \
/genetics/elbers/test/fly2/quast5/report.tsv | \
perl -pe "s/flye_pacbio_raw_keep_haplo_ratatosk.//g" | \
perl -pe "s/flye_pacbio_raw_keep_haplo/flye_raw/g" | \
column -ts $'\t'

Assembly                     k31         k41          k31.k41      flye_raw
# contigs (>= 0 bp)          132         352          158          301
# contigs (>= 1000 bp)       131         333          156          277
# contigs (>= 5000 bp)       126         238          148          198
# contigs (>= 10000 bp)      117         192          126          155
# contigs (>= 25000 bp)      105         157          109          123
# contigs (>= 50000 bp)      88          127          89           91
Total length (>= 0 bp)       120546892   133934135    131609107    134023395
Total length (>= 1000 bp)    120545982   133920513    131607527    134006576
Total length (>= 5000 bp)    120526866   133606513    131586885    133769536
Total length (>= 10000 bp)   120459333   133290286    131426311    133480122
Total length (>= 25000 bp)   120256840   132670162    131129782    132966899
Total length (>= 50000 bp)   119630296   131555713    130371773    131725453
# contigs                    132         352          158          301
Largest contig               9942397     12326193     17064634     17041207
Total length                 120546892   133934135    131609107    134023395
Reference length             140694204   140694204    140694204    140694204
Reference GC (%)             42.05       42.05        42.05        42.05
N50                          4056694     3316413      7365893      8571016
NG50                         3701606     3127792      6147490      6120163
N90                          723206      363243       592870       679437
NG90                         -           179968       203468       218073
L50                          10          11           6            5
LG50                         13          12           7            6
L90                          31          54           26           24
LG90                         -           80           50           41
# misassemblies              26          56           44           38
# misassembled contigs       18          42           32           30
Misassembled contigs length  18435008    34001255     42829080     63273702
# local misassemblies        19          35           22           10
# scaffold gap ext. mis.     0           0            0            0
# scaffold gap loc. mis.     0           0            0            0
# unaligned mis. contigs     0           0            0            2
# unaligned contigs          0 + 9 part  2 + 12 part  0 + 12 part  0 + 12 part
Unaligned length             21597       77357        29046        88282
Genome fraction (%)          85.487      94.620       93.299       95.007
Duplication ratio            1.002       1.005        1.002        1.001
# N's per 100 kbp            0.00        0.00         0.00         0.00
# mismatches per 100 kbp     942.95      943.53       942.39       913.91
# indels per 100 kbp         63.97       93.40        63.93        172.23
Largest alignment            9942397     9013642      14839624     14841058
Total aligned length         120481525   133783449    131515984    133819905
NA50                         3701606     2886908      6147488      6120163
NGA50                        3343034     2768541      5509015      6120163
NA90                         668779      285834       453663       487312
NGA90                        -           174809       129793       170560
LA50                         11          13           6            7
LGA50                        13          14           7            7
LA90                         36          60           30           31
LGA90                        -           87           62           53

# Analyze quality scores of reads and assemblies using
# YAK version r55 - yet another k-mer analyzer
# https://github.com/lh3/yak
cd /genetics/elbers/test/fly2

/genetics/elbers/yak/yak count -b37 -t75 -o raw.yak \
<(bgzip -@75 -cd illumina1.fastq.gz) \
<(bgzip -@75 -cd illumina2.fastq.gz) > sr.raw.yak.log 2>&1

cd /genetics/elbers/test/fly2

rm -f test2
for ref in `ls -1 flye-pacbio-corr-keep-haplo-ratatosk.k31.fasta \
flye-pacbio-raw-keep-haplo-ratatosk.k41.fasta \
flye-pacbio-corr-keep-haplo-ratatosk.k31.k41.fasta \
ratatosk.k31.k41.pacbio.fasta \
ratatosk.k31.pacbio.fasta \
flye-pacbio-raw-keep-haplo.fasta \
ratatosk.k41.pacbio.fasta`;do
/genetics/elbers/yak/yak qv -t75 /genetics/elbers/test/fly2/raw.yak \
${ref} 2>/dev/null |tail -n +1034 >> test2
done

rm -f test3
for ref in `ls -1 flye-pacbio-corr-keep-haplo-ratatosk.k31.fasta \
flye-pacbio-raw-keep-haplo-ratatosk.k41.fasta \
flye-pacbio-corr-keep-haplo-ratatosk.k31.k41.fasta \
ratatosk.k31.k41.pacbio.fasta \
ratatosk.k31.pacbio.fasta \
flye-pacbio-raw-keep-haplo.fasta \
ratatosk.k41.pacbio.fasta`;do
echo $ref >> test3
done

paste test3 <(cut -f 2-3 test2) |sort -k2,2 | \
sed '1i Sequences_or_assembly\traw_quality_value\tadjusted_quality_value' |column -tx

Sequences_or_assembly                               raw_quality_value  adjusted_quality_value
ratatosk.k41.pacbio.fasta                           17.614             17.616
ratatosk.k31.pacbio.fasta                           26.866             26.883
flye-pacbio-raw-keep-haplo.fasta                    26.961             26.967
flye-pacbio-raw-keep-haplo-ratatosk.k41.fasta       30.019             30.033
ratatosk.k31.k41.pacbio.fasta                       30.413             30.483
flye-pacbio-corr-keep-haplo-ratatosk.k31.fasta      32.008             32.029
flye-pacbio-corr-keep-haplo-ratatosk.k31.k41.fasta  32.606             32.629
GuillaumeHolley commented 4 years ago

Thanks for your experiments @jelber2!

jelber2 commented 4 years ago

You are welcome!

jelber2 commented 4 years ago

Here is another iteration of polishing with first --k1 31, then --k1 41, and finally --k1 45

# Ratatosk correction of Ratatosk corrected k=31 then Ratatosk corrected with
# k=41 and now initial k=45
/genetics/elbers/bin/Ratatosk/build/src/Ratatosk -v --k1 45 -c 75 \
-s illumina1.fastq.gz -s illumina2.fastq.gz -l ratatosk.k31.k41.pacbio.fasta \
-o ratatosk.k31.k41.k45.pacbio > ratatosk.k31.k41.k45.log 2>&1

# De novo assemble Ratatosk corrected k=31 reads recorrected with k=41
# with Flye --pacbio-corr and --keep-haplotypes options
rm -fr /genetics/elbers/test/fly2/flye10
mkdir -p /genetics/elbers/test/fly2/flye10
cd /genetics/elbers/test/fly2/flye10
/genetics/elbers/Flye-2.8-1/bin/flye --pacbio-corr \
../ratatosk.k31.k41.k45.pacbio.fasta --keep-haplotypes --out-dir ./ \
--threads 75 > flye.log 2>&1

cd /genetics/elbers/test/fly2

cp flye10/assembly.fasta flye-pacbio-corr-keep-haplo-ratatosk.k31.k41.k45.fasta

# Run Quast 5.1.0rc1 on haplotype0 of reference and flye results
# https://github.com/ablab/quast
# Remember that I converted GCA_003401745.1 to "diploid" with ~2% het rate
mkdir -p /genetics/elbers/test/fly2/quast5
cd /genetics/elbers/test/fly2/quast5
conda activate quast5.0.2
# still using QUAST 5.1.0rc1, just dependencies from quast 5.0.2 conda install
/genetics/elbers/quast-5.1.0rc1/quast.py --fragmented --fast --threads 75 \
-r ../GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.haplotype0.fasta.gz \
--eukaryote -o ./ ../flye-pacbio-corr-keep-haplo-ratatosk.k31.fasta \
../flye-pacbio-raw-keep-haplo-ratatosk.k41.fasta \
../flye-pacbio-corr-keep-haplo-ratatosk.k31.k41.fasta \
../flye-pacbio-corr-keep-haplo-ratatosk.k31.k41.k45.fasta \
../flye-pacbio-raw-keep-haplo.fasta \
> quast.log 2>&1 &
conda deactivate

# QUAST results but rename assemblies
# so flye-pacbio-raw-keep-haplo.fasta=flye_raw
# so flye-pacbio-corr-keep-haplo-ratatosk.k31.fasta=k31
# so flye-pacbio-raw-keep-haplo-ratatosk.k41.fasta=k41
# so flye-pacbio-corr-keep-haplo-ratatosk.k31.k41.fasta=k31.k41
# so flye-pacbio-corr-keep-haplo-ratatosk.k31.k41.k45.fasta=k31.k41.k45
perl -pe "s/flye_pacbio_corr_keep_haplo_ratatosk.//g" \
/genetics/elbers/test/fly2/quast5/report.tsv | \
perl -pe "s/flye_pacbio_raw_keep_haplo_ratatosk.//g" | \
perl -pe "s/flye_pacbio_raw_keep_haplo/flye_raw/g" | \
column -ts $'\t'

Assembly                     k31         k41          k31.k41      k31.k41.k45  flye_raw
# contigs (>= 0 bp)          132         352          158          158          301
# contigs (>= 1000 bp)       131         333          156          154          277
# contigs (>= 5000 bp)       126         238          148          140          198
# contigs (>= 10000 bp)      117         192          126          125          155
# contigs (>= 25000 bp)      105         157          109          104          123
# contigs (>= 50000 bp)      88          127          89           89           91
Total length (>= 0 bp)       120546892   133934135    131609107    132172368    134023395
Total length (>= 1000 bp)    120545982   133920513    131607527    132169717    134006576
Total length (>= 5000 bp)    120526866   133606513    131586885    132125079    133769536
Total length (>= 10000 bp)   120459333   133290286    131426311    132017287    133480122
Total length (>= 25000 bp)   120256840   132670162    131129782    131632757    132966899
Total length (>= 50000 bp)   119630296   131555713    130371773    131072498    131725453
# contigs                    132         352          158          158          301
Largest contig               9942397     12326193     17064634     19032407     17041207
Total length                 120546892   133934135    131609107    132172368    134023395
Reference length             140694204   140694204    140694204    140694204    140694204
Reference GC (%)             42.05       42.05        42.05        42.05        42.05
N50                          4056694     3316413      7365893      8573362      8571016
NG50                         3701606     3127792      6147490      7619130      6120163
N90                          723206      363243       592870       549863       679437
NG90                         -           179968       203468       203799       218073
L50                          10          11           6            5            5
LG50                         13          12           7            6            6
L90                          31          54           26           25           24
LG90                         -           80           50           48           41
# misassemblies              26          56           44           38           38
# misassembled contigs       18          42           32           28           30
Misassembled contigs length  18435008    34001255     42829080     61981261     63273702
# local misassemblies        19          35           22           26           10
# scaffold gap ext. mis.     0           0            0            0            0
# scaffold gap loc. mis.     0           0            0            0            0
# unaligned mis. contigs     0           0            0            0            2
# unaligned contigs          0 + 9 part  2 + 12 part  0 + 12 part  0 + 12 part  0 + 12 part
Unaligned length             21597       77357        29046        22786        88282
Genome fraction (%)          85.487      94.620       93.299       93.713       95.007
Duplication ratio            1.002       1.005        1.002        1.002        1.001
# N's per 100 kbp            0.00        0.00         0.00         0.00         0.00
# mismatches per 100 kbp     942.95      943.53       942.39       943.76       913.91
# indels per 100 kbp         63.97       93.40        63.93        64.43        172.23
Largest alignment            9942397     9013642      14839624     17337786     14841058
Total aligned length         120481525   133783449    131515984    132121947    133819905
NA50                         3701606     2886908      6147488      6147347      6120163
NGA50                        3343034     2768541      5509015      6147347      6120163
NA90                         668779      285834       453663       442583       487312
NGA90                        -           174809       129793       145149       170560
LA50                         11          13           6            6            7
LGA50                        13          14           7            6            7
LA90                         36          60           30           31           31
LGA90                        -           87           62           60           53

cd /genetics/elbers/test/fly2

rm -f test2
for ref in `ls -1 flye-pacbio-corr-keep-haplo-ratatosk.k31.fasta \
flye-pacbio-raw-keep-haplo-ratatosk.k41.fasta \
flye-pacbio-corr-keep-haplo-ratatosk.k31.k41.fasta \
flye-pacbio-corr-keep-haplo-ratatosk.k31.k41.k45.fasta \
ratatosk.k31.k41.k45.pacbio.fasta \
ratatosk.k31.k41.pacbio.fasta \
ratatosk.k31.pacbio.fasta \
flye-pacbio-raw-keep-haplo.fasta \
ratatosk.k41.pacbio.fasta`;do
/genetics/elbers/yak/yak qv -t75 /genetics/elbers/test/fly2/raw.yak \
${ref} 2>/dev/null |tail -n +1034 >> test2
done

rm -f test3
for ref in `ls -1 flye-pacbio-corr-keep-haplo-ratatosk.k31.fasta \
flye-pacbio-raw-keep-haplo-ratatosk.k41.fasta \
flye-pacbio-corr-keep-haplo-ratatosk.k31.k41.fasta \
flye-pacbio-corr-keep-haplo-ratatosk.k31.k41.k45.fasta \
ratatosk.k31.k41.k45.pacbio.fasta \
ratatosk.k31.k41.pacbio.fasta \
ratatosk.k31.pacbio.fasta \
flye-pacbio-raw-keep-haplo.fasta \
ratatosk.k41.pacbio.fasta`;do
echo $ref >> test3
done

paste test3 <(cut -f 2-3 test2) |sort -k2,2 | \
sed '1i Sequences_or_assembly\traw_quality_value\tadjusted_quality_value' |column -tx

Sequences_or_assembly                                   raw_quality_value  adjusted_quality_value
ratatosk.k41.pacbio.fasta                               17.614             17.616
ratatosk.k31.pacbio.fasta                               26.866             26.883
flye-pacbio-raw-keep-haplo.fasta                        26.961             26.967
flye-pacbio-raw-keep-haplo-ratatosk.k41.fasta           30.019             30.033
ratatosk.k31.k41.pacbio.fasta                           30.413             30.483
flye-pacbio-corr-keep-haplo-ratatosk.k31.fasta          32.008             32.029
ratatosk.k31.k41.k45.pacbio.fasta                       32.332             32.442
flye-pacbio-corr-keep-haplo-ratatosk.k31.k41.fasta      32.606             32.629
flye-pacbio-corr-keep-haplo-ratatosk.k31.k41.k45.fasta  32.655             32.679
GuillaumeHolley commented 4 years ago

Thanks @jelber2. Seems like the 3-rounds Ratatosk correction (k=31/41/45) is doing well.

jelber2 commented 4 years ago

Yeah, it seems like (and this likely assembly/individual-specific), that one could find some optimum choice of iterative --k1 values to approximate something like HiFi reads.

HenrivdGeest commented 4 years ago

I also played with different kmer sizes. I found that for bigger than 63 kmers I need to use a different binary, it cannot run both the small and big kmers. So for k31/k64 I use the default binary of Ratatosk, and for a k61/k95 I use te recompiled Ratatosk version. After 4 iterations of increasing kmer lengths, I can see that it does improve a bit on quality. Although in my case it seems a minor gain, maybe a few percent more accurate. I do see many patches of uncorrected, or hardly, regions when I map back the error corrected ratatosk reads. These might be repetitive regions, or problems for the debruijn graph to resolve. Nonetheless, I am impressed with the polished quality of the long reads, as well as the speed this all operates. I did not test this, but it might be faster than going for a full canu round, including error correction of the raw reads. Although it would require a bit more testinng on how the assembly quality is for a plant, compared to full canu. Something to think about :)

GuillaumeHolley commented 4 years ago

Hi @HenrivdGeest,

As mentioned earlier, you indeed need to recompile Ratatosk to use a k-mer size larger than 63. However, you shouldn't have to use two different binaries: if you recompile Ratatosk for k=95, all k-mer lengths smaller or equal to 95 should be available in the binary. Regarding the error rate, Ratatosk is rather conservative in its correction: if the subsequences to correct are too long, if there are too many similar paths possible, if the traversed subgraph is too branching or there is only poor coloring information (among other reasons), Ratatosk will prefer not to correct rather than making the wrong correction call. In the future, we are thinking about integrating phasing information in the correction to improve the quality (using WhatsHap for example) so that might be something of interest to you.

Do not hesitate to post here any benchmark you have about the read data set after/before correction or the assembly of the reads. I do not work with plant genomes but I would be happy to provide any help if I can.

GuillaumeHolley commented 4 years ago

Maybe one last thing to try @HenrivdGeest would be smaller k1/k2-mers for the initial Ratatosk run: maybe 25/45? That would increase the number of matches between the graph and the long reads but decrease the contiguity of the graph. It might actually be beneficial in your case.

HenrivdGeest commented 4 years ago

This is what I tried: phase=4; /home/geest/bin/Ratatosk_k64 -s short.fasta -l long_with_errors.fasta a -c 128 -o phase1${phase} --verbose --insert_sz 1000 --k1 21 --k2 51; /home/geest/bin/Ratatosk_k64 -s short -l phase1${phase}.fasta -c 64 -o phase2${phase} --verbose --insert_sz 1000 --k1 51 --k2 63 ; /home/geest/bin/Ratatosk_k96 -s short -l phase2${phase}.fasta -c 32 -o phase3${phase} --verbose --insert_sz 1000 --k1 61 --k2 95; /home/geest/bin/Ratatosk_k96 -s short -l phase3${phase}.fasta -c 32 -o phase4${phase} --verbose --insert_sz 1000 --k1 71 --k2 95 ; And these are the mapping numbers, measured with Alfred QC ( I took reads from a 4Mbase plant contig as a test)

uncorrected: MismatchRate 0.0338654 DeletionRate 0.0270461 InsertionRate 0.0353234 ErrorRate 0.096282 phase14.fasta.bam MismatchRate 0.00425731 DeletionRate 0.000926536 InsertionRate 0.0019838 ErrorRate 0.0071946 phase24.fasta.bam MismatchRate 0.0035878 DeletionRate 0.000633081 InsertionRate 0.00134147 ErrorRate 0.00558711 phase34.fasta.bam MismatchRate 0.00293067 DeletionRate 0.000530815 InsertionRate 0.000975063 ErrorRate 0.004459 phase44.fasta.bam MismatchRate 0.00290499 DeletionRate 0.000523178 InsertionRate 0.001924023 ErrorRate 0.00437431

to the error rate went from ~10% to 0,7% in the first round, and got to 0,4% after 4 iterations. Although especially the insertion rate got a nice boost from 0,2% to 0.1% from iteration 1 to iteration 4. Still I am impressed by the error rate just after 1 round.