GPfates bifurcation statistics on known pseudotime and posterior probabilities

Teichlab / GPfates

MIT License

19 stars 4 forks source link

GPfates bifurcation statistics on known pseudotime and posterior probabilities #6

Open koenvandenberge opened 5 years ago

koenvandenberge commented 5 years ago

Suppose that I would like to assess whether gene expression differs between two lineages of a trajectory, when I know the true pseudotime and posterior probabilities for each cell in the dataset. Is this possible within GPfates?

Thank you in advance.

zji90 commented 4 years ago

I would like to know whether GPfates can identify differential genes between lineages as well.

vals commented 4 years ago

If I understand correctly what you are asking, I think this is what we did with the 'bifurcation_statistics'. For each gene, we created one 'overlapping mixture of Gaussian processes' (OMGP) model with the fitted posterior probabilities, as well as an OMGP model where posterior probabilities were set to equal for each branch (lineage?). By comparing the likelihood between these two models we could evaluate how much better the data is modeled by including the lineage information.

The code we used for this is here: https://github.com/Teichlab/GPfates/blob/master/GPfates/gp_utils.py#L69

I guess this is what you mean by gene expression differing or being differential between lineages?

zji90 commented 4 years ago

Thank you for your response. I think this is what I meant. I wonder how to extract the differential gene results after I call m.calculate_bifurcation_statistics()?

vals commented 4 years ago

Hm for some reason that method (m.calculate_bifurcation_statistics()) doesn't seem to actually capture the output from the bifurcation_statistics() function.

If you run the function instead, as in

diff_stats = bifurcation_statistics(m.fate_model, m.e)

then the diff_stats DataFrame will have a column D which is the likelihood ratio between the alternative models. So you can rank genes according to this.

zji90 commented 4 years ago

Thank you for the helpful comment. I am able to get the likelihood ratio now. Just wondering whether it is possible to get p-values for the differential gene test as well?

vals commented 4 years ago

In theory you could pass the likelihood ratio (D) to a chi-square distribution. But it is unclear what the difference in degrees of freedom is between these models.

In the paper where we used this we created a kind of null distribution based on shuffling expression values over pseudotime. In the DataFrame returned from bifurcation_statistics() there is a column shuff_D. This column can represent a "null distribution", so if you compare every gene's D value with the distribution of shuff_D you can estimate the false discovery rate. Then you can make a threshold based on FDR similar to how people typically use a threshold on nominal p-values.

zji90 commented 4 years ago

Thank you! One last question, is it possible to identify differential genes within one branch (i.e. genes whose expressions do not change over pseudotime)?

vals commented 4 years ago

Genes were the mean is higher in one linage vs the another linage without taking the pseudotime into account? There might be a way to do this, but it will be faster and easier to just use some linear model I think.