RacimoLab / simGL

Simulate genotype likelihoods from tree sequence data
ISC License
5 stars 1 forks source link

generalization and vectorization of GL computation #18

Closed MoiColl closed 2 years ago

MoiColl commented 2 years ago

In this branch I've generalised sim_allelereadcounts() for user specified "ploidy" instead of assuming diploid individuals. Furthermore, I've also generalised the genotype likelihood computation for ploidy and vectorised it's computation (creating a new function allelereadcounts_to_vGL()). I left the original allelereadcounts_to_GL (which assumes diploid individuals and uses for loops instead of vectorised functions) in case we want to compare both performance. I've also created a couple of functions that can be useful to test the code: read_angsd_gl() which reads the GL output from angsd and formats it into numpy array; msqrd() mean squared root difference between two arrays that can be used to compute differences between two arrays (e.g., for GL calculation comparison).

codecov[bot] commented 2 years ago

Codecov Report

Merging #18 (98e1857) into main (6fda57b) will decrease coverage by 44.79%. The diff coverage is 43.68%.

:exclamation: Current head 98e1857 differs from pull request most recent head d09f1c5. Consider uploading reports for the commit d09f1c5 to get more accurate results

Impacted file tree graph

@@             Coverage Diff             @@
##             main      #18       +/-   ##
===========================================
- Coverage   89.13%   44.33%   -44.80%     
===========================================
  Files           2        2               
  Lines          46      106       +60     
===========================================
+ Hits           41       47        +6     
- Misses          5       59       +54     
Impacted Files Coverage Δ
simGL/simGL.py 43.80% <43.13%> (-45.08%) :arrow_down:
simGL/__init__.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6fda57b...d09f1c5. Read the comment docs.

MoiColl commented 2 years ago

Thanks! Here are some answers to some of your comments:

1) I didn't realize there was this simGL.egg-info directory until I committed. I believe it stems from the setuptools packaging process. I'll add it to .gitignore

2) I'll change from setup.py to setup.cfg and pyproject.toml. When I started with that file, I saw that you suggested using those two, but Jo Bovy's guide suggests using setup.py. Since I found his explanation was very clear, I decided to stick to it.

3) There is a feature of numpy that I'm using that is only available for version 1.22.1, so I think that for numpy we must specify a minimal version. For the rest, if we don't want to specify a version, you just give its name, right? This should be consistent with the requirements.txt file, right?

4) I specified the python version I'm using, but I guess we can test for different python versions. Can this be set up in the tests?

5) Regarding my email, I was actually concerned with what you say. Thanks for pointing this out.

MoiColl commented 2 years ago

Have you seen that tests for ubuntu-18.04, 3.7 and macos-10.15, 3.7 fail? I think it's due to the numpy version. How can this be solved? does it mean that this package will not be installable for those OS?

grahamgower commented 2 years ago

Thanks! Here are some answers to some of your comments:

  1. I didn't realize there was this simGL.egg-info directory until I committed. I believe it stems from the setuptools packaging process. I'll add it to .gitignore

Cool. Are you familiar with the idea of "squashing" commits? Its possible to remove the folder, commit the change, then "sqaush" those commits so that the commits that added and then removed the folder don't appear in the history. Don't worry about this if you don't want to spend the time figuring out the details (but it's super useful to know).

I've found this online git book to be very nice. https://git-scm.com/book/en/v2

  1. I'll change from setup.py to setup.cfg and pyproject.toml. When I started with that file, I saw that you suggested using those two, but Jo Bovy's guide suggests using setup.py. Since I found his explanation was very clear, I decided to stick to it.

No problem. It doesn't matter so much, I guess. When you go to install a package with a setup.py file, the usual way is/was to run python setup.py something, but this is deprecated and many of the subcommands are starting to break. If a package only has a setup.cfg, that command can't be used, but you can do things like pip install . instead (which should also work if there's just a setup.py file).

  1. There is a feature of numpy that I'm using that is only available for version 1.22.1, so I think that for numpy we must specify a minimal version.

Oh, ok. You've done the right thing then.

For the rest, if we don't want to specify a version, you just give its name, right?

Yep.

This should be consistent with the requirements.txt file, right?

Not necessarily. The requirements.txt file(s) aren't used when the package gets installed (e.g. with pip install). Different projects use requirements.txt files in different ways. I usually use them to mean "here is a collection of versions that are known to work for developing the package", and then this set of package versions can be used in the CI too. Users get the versions defined by the setup.py/setup.cfg, and these are usually more relaxed in terms of version constraints.

  1. I specified the python version I'm using, but I guess we can test for different python versions. Can this be set up in the tests?

Yep, we can change the tests to suit any version(s).

  1. Regarding my email, I was actually concerned with what you say. Thanks for pointing this out.

:+1:

Have you seen that tests for ubuntu-18.04, 3.7 and macos-10.15, 3.7 fail? I think it's due to the numpy version. How can this be solved? does it mean that this package will not be installable for those OS?

Numpy dropped support for Python 3.7 with numpy 1.22. So the options are: (1) simGL also drops support for Python 3.7, or (2) find an alternative for the new numpy feature(s) that simGL is using.

What is the feature that requires numpy 1.22?

MoiColl commented 2 years ago

In the function sim_allelereadcounts() I use the following statement:

arc = rng.multinomial(DP, err[gmbp])

which DP is an numpy array with shape (sites, individuals), err is a two dimension numpy array (4, 4) that corresponds to the allele probability. Example: using the convention that each base pair corresponds to an index position (A = 0, C = 1, G = 2 and T = 3) the first row, err[0, :], corresponds to when the correct allele is "A" and thus, the allele error probabilities are [1-e, e/3, e/3, e/3]. With the same logic err[1, :] = [e/3, 1-e, e/3, e/3] and so on. Finally gmbp is a numpyarray very similar to the ts.genotype_matrix(). The only difference is that 0 = ancestral and 1 = derived alleles, here I encoded the base pairs (A = 0, C = 1, G = 2 and T = 3).

So, basically, what I achieve is to sample from a multinomial distribution the allele read counts with an base-pair specific probability array. Using arrays in both arguments (n and p) for the multinomial() function is only possible in the new numpy version.

If you think there is a vectorized way that we can achieve that, maybe we can modify this function and be able to use older numpy functions.

grahamgower commented 2 years ago

That seems like very reasonable usage of numpy. Lets just drop Python 3.7.