Finetuning/replicating reverse perturbation prediction

nmahieu-empresstx commented 1 year ago

Hi!

I am working to replicate your perturbation prediction result. A little clarity as to the method would be helpful.

For this task was a model finetuned on the distinct task "reverse perturbation prediction"? Or alternatively was the model finetuned for forward perturbation prediction used (and the perturbed genes were identified by Euclidean distance).

In either case of "reverse perturbation prediction", how were the KO genes masked in the RNA abundances? Given the KO gene is significantly depleted a naïve algorithm would simply identify that / those genes highly depleted as perturbed.

Depleted genes as outliers shown below.

subercui commented 1 year ago

Hi @nmahieu-empresstx, Thanks! I am trying to understand the question you provided. Could you further explain the figure you drew here?

nmahieu-empresstx commented 1 year ago

Thanks for the quick reply. Two part question.

Observing genes targeted by the CRISPR guide makes it trivial to identify the gene which was perturbed. This is illustrated in the graphic where YDJC transcript abundance is on the y and the control transcript abundance is shown on the x. The YDJC transcript is highlighted in red. Were these values masked when training/evaluating in the reverse prediction task? Performance interpretation differs if this information is relied on to predict the perturbed gene.
Was the reverse perturbation task finetuned differently than the perturbation prediction task? (with different gene masking / gene flags such that on prediction the pytorch model takes transcript changes as an input rather than a perturbed gene)

subercui commented 1 year ago

Hi @nmahieu-empresstx , thank you for the explanation. I think to answer your questions, let me first explain the rationale for the reverse perturbation task.

The ultimate goal for this task is to simulate for this ambitious question: if one has observed cell states of a group that are developed or transformed from an origin state, is it possible to predict what stimulation can derive this transition? We think the question can be super interesting, for example, imagine you observed some differentiated cell states, and if you have a perfect predictor trained for ipsc cells, it may tell you what genetic factors can induce this differentiation. Other imaginary examples of this kind can be made similarly for finding genetic targets for disease treatment.

Based on this goal, we used the perturbation data and setting to illustrate this process. Note that, the goal was set for a much broader scope, but we did think that perturbation data was one of the scenarios that can be exemplified now since the existence of the public datasets. In our manuscript, we described this "reverse perturbation" in a more conservative way because we think it is more appropriate, while we did have the ultimate goal above in mind when we introduced this.

So, regarding your questions:

If I understand correctly, you mean that one may find a more straightforward way to detect the CRISPR perturbed gene by examining the expression level between conditions. I think the specific approach you mentioned can be challenging, since there can be many more differentially expressed genes after perturbation, especially when perturbing driving genes related to important processes such as differentiation and cell cycle. Therefore, it will be difficult to tell which genes of the many differentially expressed ones are the original CRISPR perturbed ones. Now although the questioning to the specific strategy you mentioned, on the other hand, I can imagine the possibility of finding another straightforward way to perform the prediction. The more important point I want to highlight is again the goal of this task as mentioned above, which is expected to go beyond CRISPR-based or other perturbations in the long run. That is why I think the current approach we proposed, which tries to train a model to generally capture the transition process, can be quite meaningful
The model was fine-tuned in the same way as in the "forward" perturbation prediction. It used a subset of the dataset as we illustrated in Figure 3F. The reverse perturbation task utilized the model in a different way. To summarize, the result cell states of all possible perturbations were predicted by the fine-tuned model, and then an actual sequenced cell state can query all the predicted cell states in a nearest neighbor search manner, so that the retrieved neighbors indicate the possible origin perturbations.

I hope these explanations make sense to you. In the meantime, we have actually upgraded the pipeline for reverse perturbation recently. We will release a tutorial with updated pipeline and results soon.

nmahieu-empresstx commented 1 year ago

@subercui - First thank you very much for taking the time to clarify and answer my questions. It's quite helpful. Second I wanted to emphasize that I love the work, the model is very exciting and I have enjoyed digging in. Criticism was not implied.

I do agree that the goal of discovering the stimulus driving a perturbation generally is an ambitious and laudable goal! Hence my question - how good are we at reverse perturbation prediction when we don't have a large and obvious change pointing to the perturbation. Consider the question a compliment to your work!

So on point 1 we agree - the goal is to find the stimulus independent of an obvious gene knockdown. In combination with point 2 you have clarified my question. We condition on the observed knockdown. Still - your results from perturbation prediction much more signal to match to - exciting!

Also on point 1, could you clarify "which tries to train a model to generally capture the transition process, can be quite meaningful." In this do you mean the foundation model as the general model?

Point 2. Perfect! Thanks for clarifying.

Nathaniel

bowang-lab / scGPT

Finetuning/replicating reverse perturbation prediction #87