Question about bagging - Githubissues

silvewheat commented 3 years ago

Hi,

I have some question about bagging process.

In the following code: res_loter = lc.loter_local_ancestry([H_ref1, H_ref2, H_ref3, H_ref4, H_ref5], H_query) It seems that res_loter[1] only contained the number of the most frequent vote. Is there any way to get the vote count for each population in the reference list? Because I want to mask some region as ambiguous when the count of the most frequent vote is close to the second vote.
If for each five reference population, the percentage of vote is 20%. What the res_loter[0] will return?
What the meaning of the parameter "rate_vote"?

best, Yudong

gdurif commented 3 years ago

Hi, Thanks for your interest in Loter.

I will try to answer all your questions.

It is not directly possible to get the detail of the vote for each reference population, but here is a chunk of code adapted from Loter source (in particular https://github.com/bcm-uga/Loter/blob/master/python-package/loter/locanc/local_ancestry.py#L212) to compute it:

# load a toy dataset from Loter `data` directory (assuming we are in Loter project root directory)
import os
import numpy as np

H_ceu = np.load(os.path.join("data", "H_ceu.npy"), allow_pickle=True)[:,:5000]
H_yri = np.load(os.path.join("data", "H_yri.npy"), allow_pickle=True)[:,:5000]
H_mex = np.load(os.path.join("data", "H_mex.npy"), allow_pickle=True)[:,:5000]

###### from here is the interesting code

# import
from loter.local_ancestry import boostrap_loter_multiple_pops, mode

# inout paramters
l_H=[H_ceu, H_yri]
h_adm=H_mex
range_lambda=np.arange(1.5, 5.5, 0.5)
nb_bagging=20
num_threads=8

# chunk of code to get bagging vote details
input_loter = (l_H, h_adm)
n, m = h_adm.shape
counts = np.zeros((len(l_H), n, m))
for l in range_lambda:
    res_boostrap = boostrap_loter_multiple_pops(
        *input_loter, lambd=l,
        counts=counts, nbrun=nb_bagging,
        num_threads=num_threads
    )

# counts is a list containing arrays of number of votes per individual (in rows) 
# and per SNP (in columns) for each reference populations
counts

# to get the majority vote
res_tmp = mode(counts)

In case of equality, the index of the first reference population will be return, so it depends on the order in the input [H_ref1, H_ref2, H_ref3, H_ref4, H_ref5]. It is not ideal, we should definitely work on this to better manage the indecisive case.
rate_vote is not relevant in this mode, I corrected the notebook https://github.com/bcm-uga/Loter/blob/master/python-package/Local_Ancestry_Example.ipynb and https://github.com/bcm-uga/Loter/blob/master/python-package/Local_Ancestry_Example.md accordingly.

You can also use the mode default=False of lc.loter_local_ancestry to enable a correction of indecisive SNP (i.e. SNP with a vote lower than rate_vote), that are corrected based on nearest SNP for which the vote is above rate_vote. In this mode, you can choose the level of indecision.

silvewheat commented 3 years ago

Hi,

Thanks for your detailed answers. It is very helpful.

best, Yudong

bcm-uga / Loter

Question about bagging #21