eriqande / rubias

identifying and reducing bias in hierarchical GSI
2 stars 3 forks source link

Z values and PofZ #30

Open ronahuel opened 3 years ago

ronahuel commented 3 years ago

Hi Eric and developers. Thanks for the useful program you leave at our disposal.

I am using the program to evaluate the assignment of certain individuals to reference groups, of Atlantic salmon.

I used the following command to estimate the mixing proportions:

mix_est <- infer_mixture(reference = referencia, mixture = libres, gen_start_col = 5, method = "PB", reps = 50000, burn_in = 5000)

The program works quite well on my computer and I have no memory problems running it. However, I have found that the results are relatively different from the analyses I did with the program STRUCTURE & DPCA. My data set corresponds to 456 individuals for reference and 80 individuals to evaluate their origin. As I read in the tutorial, the Z-values serve to evaluate to what extent the individuals to be evaluated fit the reference individuals. So here my doubt arises:

image

As you can see, the Z values don't quite fit the normal curve. The individual's lowest Z-scores and their PofZ were: -22,76370825 / 1 -21,69095868 / 1 -21,65942465 / 1 -13,96500187 / 1 Should I then assume that these samples do not really come from the populations which they were assigned to? I also have other individuals with a PofZ of 1 but with a lesser Z-scores.

I would expect differences between Z-scores distribution and a normal distribution, since the reference populations are probably purer than the commercial lines from which the samples analyzed (probably) came. This might explain the difference between the normal distribution and the distribution of Z-scores. However, I do not find reliable that individuals are assigned to certain collections with a PofZ=1.

Is it possible to define threshold based on z-scores to define which individuals can be confidently assigned to the reference? Perhaps Z scores > 5 ?

Any help or point of view would be very helpful and appreciated.

eriqande commented 3 years ago

Hi Ronahuel,

From the looks of it, and by your description, it sounds to me like the individuals you are trying to place might be admixed between different reference populations. If that is the case, then rubias, and the conditional genetic stock ID (GSI) model underlying it, would not be entirely appropriate. The GSI model assumes individuals are purely of one reference group or another, and there is no allowance for admixed individuals. If you have many admixed individuals, then STRUCTURE provides a more appropriate model to apply to the situation.

eric