bioXiaoheng / BallerMixPlus

This repository hosts the software package for BalLeRMix+, an extension of BalLeRMix that can jointly detect recent positive selection and long-term balancing selection.
MIT License
5 stars 1 forks source link

Run time #1

Closed Thatguy027 closed 2 years ago

Thatguy027 commented 2 years ago

Hello

I got your software up and running on my data set, which has hundreds of thousands of variants. The run time is a bit slow, so I am wondering what you think the best way to decrease the run time is. Some options that immediately come to mind include pruning by LD and parallelization, but wanted to hear your input before implementing.

Thanks so much!

bioXiaoheng commented 2 years ago

Hi!

Great to hear it's up & running! The run time for BalLeRMix+ can indeed be long (and potentially memory-expensive) primarily because it's computing over a much wider parameter space than BalLeRMix. Some platform/environment (e.g. WSL with conda) will also run substantially slower than others.

To speed it up, I think parallelization is the way to go. Because the method does account for (and get information from) linkage, I don't recommend LD-pruning, but you can potentially break a chromosome into non-combining contigs and run the software on them separately.

You can potentially also shrink the parameter space for each scan. Arguments --fixX, --fixAlpha, --rangeA, and --listA can be used to customize the search grids for these parameters. Or you can search for balancing selection or positive selection separately using --findBal or --findPos, which is equivalent to restricting the dispersion parameter alpha in beta-binomial distribution to >1 or <1.

Hope that helps!

Thatguy027 commented 2 years ago

Thanks for the quick response!

What do you mean by "non-combining contigs"? My interpretation is parallelize across non-overlapping windows of a chromosome, but since the software uses information from linked sites - it seems like it might be better to split the genome into something like 1Mb bins with 50Kb overlap among bins. But maybe I am interpreting what you said incorrectly.

Thanks!

bioXiaoheng commented 2 years ago

Hi! Ah sorry for the typo; it was meant to be "non-recombining contigs". But yes, your interpretation is correct. You can split the chromosome into multiple stretches of sequences, each with ~1 Morgan overlapping (which can vary in bp based on your recombination map).

Thatguy027 commented 2 years ago

Makes sense - will put together something to parallelize next week. Thanks!

gabrielluishernandez commented 2 years ago

Hello @bioXiaoheng and @Thatguy027 I think we have encountered the same problem as you. Also, there is a high memory demand reflected in a “bus error” exit in our trials. Based on your conversation above, we plan to cut our chromosomes into overlapping windows and run each window individually. We plan to make these windows not from the vcf but rather from the parsed file (i.e. output of parse_ballermix_input.py). The helper file would be created from the entire chromosome. Then, we would run the B statistics for each window together with the chromosome-wide helper file. We wanted to check if your approach worked to run in parallel many windows. Do you think our plan is logical or, are we missing something?

Thank you very much! (p.s. we do not have a recombination map)