bcgsc / LINKS

⛓ Long Interval Nucleotide K-mer Scaffolder
GNU General Public License v3.0
73 stars 15 forks source link

interpretation of output #36

Closed dcopetti closed 4 years ago

dcopetti commented 5 years ago

Hello, I am running LINKS to scaffold a repeat-rich plant genome (5 Gb, N50 197 kb, N70 71 kb) with ONT data (80 Gb, 16x, all above 5 kb reads). I trust the assembly, since it has been aligned to an optical map and the vast majority of the sequences align continuously to the restriction pattern.

I run it as LINKS.pl -f genome_v2.2.fa -s input_ONT_b123.fofn -k 23 -d 4000 -t 100 -a 0.2 -z 2000 -r genome.bloom -b base with a parameter sweep (-d from 1000 to 4000, -t 100 and 50, 8 combinations for now) and I am getting the first results - very little increase in N values (199 and 72 kb). I want to be strict in the scaffolding to avoid linking just by the presence of one repeat, so I decreased -a. I wonder if the output files contain some information I could use to tune the parameters and find a good combination.

The pairing_issues file has very large negative values:

Pairs unsatisfied in distance within a contig pair.  A-> <-B  WITH tig#825 -> -849140 <- tig#71, A=654381 nt (start:477924, end:477947) B=1445674 nt (start:674683, end:674660) CALCULATED DISTANCE APART: -849140 < -200
Pairs unsatisfied in distance within a contig pair.  rB-> <-A WITH tig#r.2180 -> -690599 <- tig#735, B=406356 nt (start:228961, end:228938) A=684072 nt (start:463638, end:463615) CALCULATED DISTANCE APART: -690599 < -200
Pairs unsatisfied in distance within a contig pair.  A-> <-B  WITH tig#8861 -> -157439 <- tig#12512, A=135117 nt (start:32672, end:32695) B=88749 nt (start:56994, end:56971) CALCULATED DISTANCE APART: -157439 < -200
Pairs unsatisfied in distance within a contig pair.  A-> <-B  WITH tig#8167 -> -150869 <- tig#17200, A=146588 nt (start:22003, end:22026) B=47450 nt (start:28284, end:28261) CALCULATED DISTANCE APART: -150869 < -200
Pairs unsatisfied in distance within a contig.  Pair (GAGACTAAATTCTGAATATGTTG - GGTAGTGTAAAGATAGGAGTAGC) on contig 1044 (592359 nt) Astart:365063 Aend:365086 Bstart:367293 Bend:367270 CALCULATED DISTANCE APART: 2230
Pairs unsatisfied in distance within a contig pair.  A-> <-B  WITH tig#1642 -> -844309 <- tig#942, A=474302 nt (start:196149, end:196172) B=617172 nt (start:568156, end:568133) CALCULATED DISTANCE APART: -844309 < -200

and this is what the log reports:

Total number of pairs extracted from -s input_ONT_b123.fofn: 14278810
At least one sequence/pair missing from contigs: 0
Assembled pairs: 617317 (1234634 sequences)
        Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 414425
        Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 13350
        Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 903
        ---
        Satisfied in distance/logic within a given contig pair (pre-scaffold): 6557
        Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 182082
        ---
Total satisfied: 420982 unsatisfied: 196335

Breakdown by distances (-d):
--------k-mers separated by 4000 bp (outer distance)--------
MIN:3600 MAX:4400 as defined by 4000 * 0.1
At least one sequence/pair missing:
Assembled pairs: 617317
        Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 414425
        Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 13350
        Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 903
        ---
        Satisfied in distance/logic within a given contig pair (pre-scaffold): 6557
        Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 182082

which values should I maximize or look after? I don't expect to increase the numbers by much, but if possible I would like to take advantage of the few >30 kb reads I have to put some pieces together. Would it help to use error corrected (with FMLRC) reads? Thanks,

Dario

warrenlr commented 5 years ago

Hi Dario, thank you for your message and interest in LINKS. What is the est. error rate in your ONT reads? If it's 15%, I suggest decreasing k, especially if your genome size <1Gbp.

-a 0.2 is very stringent. We usually don't go below 0.3.

If RAM isn't limiting, you may also decrease -t (since you only have 5X coverage -- but difficult to guess without knowing your genome size and machine specs).

The only way to take advantage of the longer reads, although few, is to increase -d to a larger number. Also, because your read length distribution appears to be centered at 5kbp, there will be fewer and fewer reads as you increase -d, so you also get to decrease -t to extract more kmer pairs.

various pairing stats reported in the log are informative, but since you're dealing with a repetitive genome, paired 23-mers may occur many times, with distances inconsistent with what the 5kbp ONT reads can capture, thus inflating the "unsatisfied in distance categories".

I suggest you try -k 21 and maybe even -k 19 with uncorrected reads. If you have corrected reads and it brings the accuracy to 99%, then k26 would work well (even higher, although I never tried).

just curious..what is -l set at...-l 5?

Distances -d 1000, 2500, 5000, 75000, 10000, 15000, 20000, 25000, 30000 adjust -t according to available RAM, but don't exceed 1TB RAM (again, depends on machine specs). It's good that you have an optical map to verify your assemblies, few have that; We often use BUSCO to monitor assembly completeness so you may wish to run to assess gene recovery (although there's still a possibility to induce misassemblies with low stringency param, as you pointed out).

Good luck, Rene

ps.: i also suggest you take a look at RAILS (https://github.com/bcgsc/RAILS), which now supports minimap2 for mapping and scaffolding with ONT reads.

dcopetti commented 5 years ago

Hi Rene,

Thank you for the useful feedback. Even though I left it default, I specified the -l parameter: -l minimum number of links (k-mer pairs) to compute scaffold (default -l 5, optional) I was thinking to tune it to increase stringency if I get some scaffolding.

For now, I am running with raw reads, -k 21 -a 0.3 -t 50 (and 30) -d 1000,2500,5000,7500,10000,15000,25000,30000 (also down to 1000,2500,5000,7500,10000).

The raw reads are from PromethION, 80 +-6% accurate, but I am also preparing FMLRC corrected reads (86 +-9% accurate). We inspected the error corrected reads and saw that they are corrected in patches: some regions have >99% accuracy (regions with unique k-mers in the assembly), other regions are almost untouched (with repeated k-mers in the assembly). Good thing, the error correction seems to be allele-specific. So hopefully the corrected regions will supply some good anchor points (I may bring back up the k value to 23 or so to have more unambiguous k-mers). I am now running with 480 GB of RAM (k=21), hopefully it will be enough. With k=23 the scaffolding step was taking up to 350 GB.

I saw RAILS, but I thought it would not work with ONT reads, I will give it a try, though the mapping step may take a while. I'll keep you posted,

Dario

warrenlr commented 5 years ago

rails will work with ONT reads, but you'll need to set the ont preset flag -x map-ont in minimap2 and set the grace parameter (-g) to 250 or so in cobbler/rails.

this shell scripts automates this for you https://github.com/bcgsc/RAILS/blob/master/versions/RAILS_v1.4.2/runRAILSminimap.sh

keep in mind that gaps are filled with bases from the ont read anchoring best to your draft, and not a consensus, so best to use corrected reads for it.