apriha / snps

tools for reading, writing, merging, and remapping SNPs
BSD 3-Clause "New" or "Revised" License
98 stars 19 forks source link

Basic ancestry functionality #143

Closed arvkevi closed 2 years ago

arvkevi commented 3 years ago

This PR adds basic functionality to predict genetic ancestry using ezancestry. @apriha please feel free to make suggestions/direct edits as you see fit, this is just to get the concept moving forward. Here's how a user could utilize this functionality from snps.

Screen Shot 2021-09-20 at 10 31 39 PM
codecov[bot] commented 2 years ago

Codecov Report

Merging #143 (d29743f) into develop (8ca5d75) will increase coverage by 0.07%. The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #143      +/-   ##
===========================================
+ Coverage    93.44%   93.52%   +0.07%     
===========================================
  Files            8        8              
  Lines         1540     1559      +19     
  Branches       273      274       +1     
===========================================
+ Hits          1439     1458      +19     
  Misses          54       54              
  Partials        47       47              
Impacted Files Coverage Δ
src/snps/snps.py 95.94% <100.00%> (+0.14%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8ca5d75...d29743f. Read the comment docs.

apriha commented 2 years ago

@arvkevi I think we're close with getting the initial tests working. However, pip is taking a long time to search for compatible packages. I can fix this via a two step install, e.g.:

pip install ezancestry
pip install .

However, that defeats the simplicity of just pip install .[ezancestry]. Any ideas on how this can be improved?

arvkevi commented 2 years ago

Thank you for hacking on this PR, Andrew! I cut a release to ezancestry that supports 3.7, which is why I triggered the build yesterday w/ an empty commit. I am confused as to why this is taking so long to resolve dependencies. I'll spend some more time with it.

apriha commented 2 years ago

Hi Kevin, same here. FYI, I tried running the test-extras job locally via act, and dependencies were resolved quickly and without any issues...

apriha commented 2 years ago

Hey @arvkevi , turns out pip couldn't find the correct version of snps since the tag version history was not available after checkout; 4582b51 fixed it! Pretty close now... looks like some issues with finding ezancestry data.

apriha commented 2 years ago

I did some more testing with act and listed the contents of the equivalent of the /home/runner/.ezancestry/data/ directory... It looks like the ezancestry Python code is looking up filenames with a different case to what's actually on the filesystem; e.g., aisnps/Kidd.AISNP.txt (Python) vs aisnps/KIDD.AISNP.txt (actual). Same for models/knn.PCA.Kidd.population.bin and models/knn.PCA.Kidd.superpopulation.bin.

Hopefully that helps speed the troubleshooting along. 🙂

arvkevi commented 2 years ago

Thanks, Andrew. I will cut a new release this weekend with a fix for the filenames. I'll also setup my own ci in ezancestry so we don't languish on this branch. Thanks for being so patient with this.

arvkevi commented 2 years ago

I think I fixed the issue with the new release. The new errors are likely due newly trained models in the release. We can probably just update the assert value.

apriha commented 2 years ago

I think we're good @arvkevi! What are your thoughts on also exposing the raw predictions dataframe?

arvkevi commented 2 years ago

@apriha I think that's a good idea. I will put together some documentation with column descriptions.

arvkevi commented 2 years ago

I'll leave this here and feel free to modify and incorporate wherever you like.

Populations described below are defined here. 'component1', 'component2', 'component3': The coordinates of the sample in the dimensionality-reduced component space. Can be used as (x, y, z,) coordinates for plotting in a 3d scatter plot.

predicted_population_population: The max predicted population for the sample.

'ACB', 'ASW', 'BEB', 'CDX', 'CEU', 'CHB', 'CHS', 'CLM', 'ESN', 'FIN', 'GBR', 'GIH', 'GWD', 'IBS', 'ITU', 'JPT', 'KHV', 'LWK', 'MSL', 'MXL', 'PEL', 'PJL', 'PUR', 'STU', 'TSI', 'YRI',: Predicted probabilities for each of the populations. These sum to 1.0.

'predicted_population_superpopulation': The max predicted super population (continental) for the sample.

'AFR', 'AMR', 'EAS', 'EUR', 'SAS': Predicted probabilities for each of the super populations. These sum to 1.0.

'population_description', 'superpopulation_name' Descriptive names of the population and superpopulations.

apriha commented 2 years ago

@arvkevi updates incorporated. Please let me know what you think... If you agree, I think it's ready to merge. Thanks again for developing this awesome capability!

arvkevi commented 2 years ago

LGTM @apriha, thank you for all your hard work on this PR!