bcgsc / LINKS

⛓ Long Interval Nucleotide K-mer Scaffolder
GNU General Public License v3.0
73 stars 15 forks source link

No scaffolding in the result. #38

Closed bbista closed 4 years ago

bbista commented 5 years ago

Hello, I have been trying scaffold a 2.5gb genome using nanopore. The commands I am using is as follows. None of the iteration seems to scaffold the genome. The assembly corresponding file shows no scaffolding. Can you think of a reason why this is happening? I have around 5X ONT data. I have also included the log file for the first iteration.

`/LINKS -f $INFASTA -s $FOFPATH -b CPI_OG1000 -d 1000 -t 10 -k 15 -l 5 -a 0.3 ./LINKS -f CPI_OG1000.scaffolds.fa -s $FOFPATH -b CPI_OG2500 -d 2500 -t 5 -k 15 -l 5 -a 0.3 -r CPI_OG1000.bloom ./LINKS -f CPI_OG2500.scaffolds.fa -s $FOFPATH -b CPI_OG5000 -d 5000 -t 5 -k 15 -l 5 -a 0.3 -r CPI_OG1000.bloom ./LINKS -f CPI_OG5000.scaffolds.fa -s $FOFPATH -b CPI_OG7500 -d 7500 -t 4 -k 15 -l 5 -a 0.3 -r CPI_OG1000.bloom ./LINKS -f CPI_OG7500.scaffolds.fa -s $FOFPATH -b CPI_OG10000 -d 10000 -t 4 -k 15 -l 5 -a 0.3 -r CPI_OG1000.bloom ./LINKS -f CPI_OG10000.scaffolds.fa -s $FOFPATH -b CPI_OG12500 -d 12500 -t 3 -k 15 -l 5 -a 0.3 -r CPI_OG1000.bloom ./LINKS -f CPI_OG12500.scaffolds.fa -s $FOFPATH -b CPI_OG15000 -d 15000 -t 3 -k 15 -l 5 -a 0.3 -r CPI_OG1000.bloom ./LINKS -f CPI_OG15000.scaffolds.fa -s $FOFPATH -b CPI_OG30000 -d 30000 -t 2 -k 15 -l 5 -a 0.3 -r CPI_OG1000.bloom

`

CPI_OG1000.log

warrenlr commented 5 years ago

one of the most obvious thing I see from the commands is the low k, I believe it is much too low for a genome of that size. I recommend you explore a range (<k30). The log indicates a good number of kmer pairs, and normal kmer pairing stats (although satisfied pairs are on the low side given how much was extracted). Let me know how it goes. Rene

bbista commented 5 years ago

Hello, I increased the k to 29 and I do see some scaffolding. It is not as extensive as I'd hoped but its there. Thank you for your help. What do you think could potentially be the cause of the low number of satisfied pairs?

Best, BBIsta

warrenlr commented 5 years ago

hard to say without glancing at the data.. what is the contiguity of your 2.5Gbp genome draft? What is the N50 length on your 5X ONT data? It is low coverage, and depending on the quality of your draft, the quality of the nanopore reads, LINKS may not find sufficient support. You may wish to relax -a (by increasing it slightly perhaps to 0.5). @lcoombe may be able to provide further insights as she recently successfully used low coverage <5x ONT data for scaffolding large conifer genomes (~20Gbp)

bbista commented 5 years ago

The genome is rather fragmentary with 70,000 scaffolds with N50 of 7072151. The N50 of ONT data is 24778 with mean read length of 10066. From what I gather, the coverage is very uneven.

Thanks for your help. bbista

lcoombe commented 5 years ago

Hi @bbista - Did you try and k values between 15 and 29? There might be a sweet spot in there. For reference, for recent LINKS with a 20 Gbp genome and low coverage ONT data, I used k=23. How is your memory usage? If it isn't too high you could also try lowering 't'.