Optimisation of the Kmer value (#) query

cement-head commented 1 year ago

Not an issue, but rather seeking clarification for Kmer choices.

Other than testing empirically, is there a method to evaluate the optimal Kmer value for optimal polishing? I have a large genome (6.2 Gbp) and there are a lot of repeats, so based on your paper, I would assume that a Kmer value of <-k 67> would be better than the default of 37. Would this reasoning seem sound? If so, why did you (based on Figure 4 and Table 2) use Kmer of 63 for polishing the thale cress data, rather than Kmer of 67? Was it time/memory considerations?

alguoo314 commented 1 year ago

Table 2 referred to fixing simulated errors on a A. thaliana assembly and using simulated reads, and Figure 4 referred to polishing on real A. thaliana data. k=37 fixed the most number of errors in the real data whereas k=63 worked the best on simulated data. You can also see this information in Table 1. K=37 was also used in the human genome polishing trials. So for your real genome, I would recommend a number closer to 37 than 63.

Maybe @alekseyzimin can provide more insights on selecting a good kmer size.

alekseyzimin commented 1 year ago

In my experience, the longer the k-mer the better for large genomes, but longer k will use more RAM for the jellyfish database.
For A.thaliana data k=63 worked best for simulated errors, and k=37 worked best for the real data. But if you look at Figure 4, you will see that k=65 or 67 worked almost as well as k=37. A.thaliana is not very repetitive and k=37 is "unique enough". The variation between different values of k was relatively minor. In our human experiments we tried k up to 47, and the limitation was the memory use, but results improved going from 37 to 47.

cement-head commented 1 year ago

Okay, thanks! This is a frog genome and has a lot of repeats, likely more than the human. So, I'm going to try 67 and see what happens. I have 2 TB ECC RAM so I'm hoping that's enough memory.

alekseyzimin commented 1 year ago

Great! With 2TB RAM, you can probably afford going up to 77. My only advice here is to make sure that your system is configured with no swap file, which is the recommended setup for high-memory servers. This is the setup we are using here at Hopkins for our 1TB+ RAM computers. Otherwise the system may start swapping, even when you are not yet using all available RAM, and the polishing will run very slow. Another suggestion, if you have root access or can ask your system admins to change settings ia to change the overcommit settings in /proc/sys/vm by running

echo 2 > /proc/sys/vm/overcommit_memory and echo 97 > /proc/sys/vm/overcommit_ratio

on system boot.

On Fri, Apr 28, 2023 at 11:05 AM Andor J Kiss @.***> wrote:

Okay, thanks! This is a frog genome and has a lot of repeats, likely more than the human. So, I'm going to try 67 and see what happens. I have 2 TB ECC RAM so we'll see what happens.

— Reply to this email directly, view it on GitHub https://github.com/alguoo314/JASPER/issues/5#issuecomment-1527707418, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHJXBWYJ646GJWHIJITXDPMEJANCNFSM6AAAAAAXOFXUMM . You are receiving this because you were mentioned.Message ID: @.***>

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

cement-head commented 1 year ago

Okay - done! and I ran the latest BUSCO 5.4.1 with the tetrapoda_odb10 dataset.

82% Complete BUSCOs!!!! This is an astonishingly amazing result. With all other polishers, including ARROW, I could only get to 58% complete BUSCOs. I used 67 for the kmer and did 3 rounds of polishing with JASPER, after two rounds with POLCA.

alguoo314 / JASPER

Optimisation of the Kmer value (#) query #5