How to adjust the -k and -w for multiple genome synteny analysis?

xxllgg commented 6 days ago

Hi there, Thank you for developing this amazing tool! I am using ntSynt for detecting synteny blocks among multiple plant genomes(>10 species) that belonging to one genus. Some assemblies are not good with shorter contig N50(~100kb), others are good, but all of them are chromosome-level. The max sequence divergence is ~7%, the min is ~1%. I used the -d 7 parameter, the results showed that there is no synteny path for some chromosomes. Then, I changed the parameters to -d 7 -k 25 -w 200 --block_size 500 --indel 50000 --merge 1000000 --w_rounds 100 50, and -d 7 -k 25 -w 10000 --block_size 1000 --indel 50000 --merge 1000000 --w_rounds 5000 1000. These results are even worse than the -d 7. How to set -w and other parameters to get syneny paths? Could you give me some advice for how to get a better result in my case (all of chromosomes should have some synteny blocks)? Sincerely, Xiaolong

warrenlr commented 5 days ago

Thank you for your message and interest in ntSynt, Xiaolong.

Initially, I would recommend that you run ntSynt between a pair of conserved chromosome-level assemblies and slowly scale from there, adjusting the parameters -- and eventually performing a systematic and broad parameter sweep (while staying within the prescribed range indicated in our preprint).

In our online preprint supplementary data, we posted initial guidelines for comparing genomes with a broad range of sequence divergence (Table S14 https://www.biorxiv.org/content/biorxiv/early/2024/02/13/2024.02.07.579356/DC1/embed/media-1.pdf?download=true) Range 1% - 10% : --block_size 1000 --indel 50000 --merge 100000 --w_rounds 250 100 That could be a starting point, but of course the characteristics and particulars of your genomes being compared will inform how you set the parameters going forward, and a sweep is recommended.

FYI -- The developer of ntSynt is currently on vacation, returning next week.

lcoombe commented 7 hours ago

Hi Xiaolong,

Indeed, when you start to get more and more input assemblies with higher divergence, it can start to be difficult to detect synteny blocks. I am continuing to look into the best parameterizations of ntSynt for these cases.

A few notes/suggestions:

Keep in mind that when using lower contiguity assemblies, this will limit the length of the synteny blocks - since the block lengths will be constrained by the least contiguous assembly
- It is possible that some of the chromosomes that are missing synteny blocks are very broken up in your less contiguous assemblies
k and w are the window size for computing minimizers (a selected subset of k-mers), which are used for the multi-genome mapping. These can be good parameters to try different values for your experiment
- Try lower/higher values of k (for example in the range of 18-32)
- Try different values of w, but keep under 1000 (ex. 200-1000). Note that the very high values of w that you used in your second test are not recommended, as that will essentially provide a very, very sparse sketch. I have not done any runs with -w higher than 1500.
You could also think about adjusting the --block_size, --merge and --indel parameters. Lowering the first will output shorter synteny blocks, and increasing the latter 2 will lead to any found synteny blocks being extended

I also do second Rene's suggestion of starting with fewer assemblies first to work out what parameters are looking good, and get a sense of the synteny between the more/less contiguous assemblies, and scaling up from there.

Thank you for your interest in ntSynt! Lauren

bcgsc / ntSynt

How to adjust the -k and -w for multiple genome synteny analysis? #47