anopheles-genomic-surveillance / anopheles-genomic-surveillance.github.io

Lecture notes for a course on genomic surveillance of malaria vectors, developed by PAMCA and MalariaGEN
https://anopheles-genomic-surveillance.github.io/
Creative Commons Attribution Share Alike 4.0 International
7 stars 22 forks source link

W3M4 - Example PCAs have changed between API versions #211

Open ahernank opened 5 months ago

ahernank commented 5 months ago

All PCAs used as examples have slight differences when running the computations using the latest API version (v8.8.0), e.g. two samples pull PCAs instead of 1, and different samples are acting as outliers. Although the overall dynamic of the PCAs is similar, individual samples are looking quite different, see examples below.

These PCAs were originally run with v4.2.0 & checked with v4.3.0.

As we cannot revert to those versions (due to python incompatibility in Colab) & it is quite confusing to follow along the video & the current explanations, I will pin the API version on this NB to v7.15.0, where these PCAs look the same.

We should take a deeper look at which v8.x.x change has caused this. We'll have to take into account the authentication updates, if we want to check the API, before W3.

For reference, some examples below:

region = "3L:15,000,000-41,000,000"
n_snps = 100_000
sample_sets = "AG1000G-CF"

pca_df, evr = ag3.pca(region=region, n_snps=n_snps, sample_sets=sample_sets)
v8.8.0 v7.15.0
newplot (99) newplot (100)

region = "3L:15,000,000-41,000,000"
n_snps = 100_000
sample_sets=["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"],
sample_query="taxon in ['gambiae', 'coluzzii'] and country == 'Burkina Faso'"
v8.8.0 v7.15.0
newplot - 2024-04-17T061520 607 newplot - 2024-04-17T061506 272
ahernank commented 5 months ago

@alimanfoo This patch is tested in bespin/collab, and all PCAs are in line with the videos. Will keep this issue open until we have a chance to take a deeper look.