Closed dcopetti closed 4 years ago
Hi Dario, thank you for your message and interest in LINKS. What is the est. error rate in your ONT reads? If it's 15%, I suggest decreasing k, especially if your genome size <1Gbp.
-a 0.2 is very stringent. We usually don't go below 0.3.
If RAM isn't limiting, you may also decrease -t (since you only have 5X coverage -- but difficult to guess without knowing your genome size and machine specs).
The only way to take advantage of the longer reads, although few, is to increase -d to a larger number. Also, because your read length distribution appears to be centered at 5kbp, there will be fewer and fewer reads as you increase -d, so you also get to decrease -t to extract more kmer pairs.
various pairing stats reported in the log are informative, but since you're dealing with a repetitive genome, paired 23-mers may occur many times, with distances inconsistent with what the 5kbp ONT reads can capture, thus inflating the "unsatisfied in distance categories".
I suggest you try -k 21 and maybe even -k 19 with uncorrected reads. If you have corrected reads and it brings the accuracy to 99%, then k26 would work well (even higher, although I never tried).
just curious..what is -l set at...-l 5?
Distances -d 1000, 2500, 5000, 75000, 10000, 15000, 20000, 25000, 30000 adjust -t according to available RAM, but don't exceed 1TB RAM (again, depends on machine specs). It's good that you have an optical map to verify your assemblies, few have that; We often use BUSCO to monitor assembly completeness so you may wish to run to assess gene recovery (although there's still a possibility to induce misassemblies with low stringency param, as you pointed out).
Good luck, Rene
ps.: i also suggest you take a look at RAILS (https://github.com/bcgsc/RAILS), which now supports minimap2 for mapping and scaffolding with ONT reads.
Hi Rene,
Thank you for the useful feedback.
Even though I left it default, I specified the -l parameter:
-l minimum number of links (k-mer pairs) to compute scaffold (default -l 5, optional)
I was thinking to tune it to increase stringency if I get some scaffolding.
For now, I am running with raw reads, -k 21 -a 0.3 -t 50
(and 30) -d 1000,2500,5000,7500,10000,15000,25000,30000
(also down to 1000,2500,5000,7500,10000).
The raw reads are from PromethION, 80 +-6% accurate, but I am also preparing FMLRC corrected reads (86 +-9% accurate). We inspected the error corrected reads and saw that they are corrected in patches: some regions have >99% accuracy (regions with unique k-mers in the assembly), other regions are almost untouched (with repeated k-mers in the assembly). Good thing, the error correction seems to be allele-specific. So hopefully the corrected regions will supply some good anchor points (I may bring back up the k value to 23 or so to have more unambiguous k-mers). I am now running with 480 GB of RAM (k=21), hopefully it will be enough. With k=23 the scaffolding step was taking up to 350 GB.
I saw RAILS, but I thought it would not work with ONT reads, I will give it a try, though the mapping step may take a while. I'll keep you posted,
Dario
rails will work with ONT reads, but you'll need to set the ont preset flag -x map-ont in minimap2 and set the grace parameter (-g) to 250 or so in cobbler/rails.
this shell scripts automates this for you
https://github.com/bcgsc/RAILS/blob/master/versions/RAILS_v1.4.2/runRAILSminimap.sh
keep in mind that gaps are filled with bases from the ont read anchoring best to your draft, and not a consensus, so best to use corrected reads for it.
Hello, I am running LINKS to scaffold a repeat-rich plant genome (5 Gb, N50 197 kb, N70 71 kb) with ONT data (80 Gb, 16x, all above 5 kb reads). I trust the assembly, since it has been aligned to an optical map and the vast majority of the sequences align continuously to the restriction pattern.
I run it as
LINKS.pl -f genome_v2.2.fa -s input_ONT_b123.fofn -k 23 -d 4000 -t 100 -a 0.2 -z 2000 -r genome.bloom -b base
with a parameter sweep (-d from 1000 to 4000, -t 100 and 50, 8 combinations for now) and I am getting the first results - very little increase in N values (199 and 72 kb). I want to be strict in the scaffolding to avoid linking just by the presence of one repeat, so I decreased -a. I wonder if the output files contain some information I could use to tune the parameters and find a good combination.The pairing_issues file has very large negative values:
and this is what the log reports:
which values should I maximize or look after? I don't expect to increase the numbers by much, but if possible I would like to take advantage of the few >30 kb reads I have to put some pieces together. Would it help to use error corrected (with FMLRC) reads? Thanks,
Dario