dms-vep / dms-vep-pipeline-3

Pipeline for analyzing deep mutational scanning (DMS) of viral entry proteins (VEPs)
Other
2 stars 0 forks source link

Issue with multiDMS v0.3.3 #158

Closed bblarsen-sci closed 3 days ago

bblarsen-sci commented 1 week ago

@Haddox @jgallowa07 Hi Hugh and Jared,

@jbloom asked me to place any correspondence about multiDMS here so others can track the issues as well.

I'm attaching my email I sent the other day to you both. If you have any additional thoughts since we discussed this on the video call, please respond to it in this thread. Thanks!


Hi Jared and Hugh,

Thanks again for the multiDMS implementation that we are using for our DMS work in the Bloom lab. While working on paper revisions, I needed to examine the multiDMS implementation of the global epistasis models and have a few questions.

From the CHANGELOG on GitHub and discussions in 'Issues', I understand there were compatibility issues between pandas v2.1 and v2.2, which were fixed in multiDMS v0.3.3. I have been using v0.3.3 for all of my global epistasis fitting.

I noticed that mutations with very low functional scores are sometimes assigned much higher functional effects following global epistasis fitting. I'm attaching a slide comparing the average functional score of single mutations with their functional effects after global epistasis fitting. You'll see several mutations with low functional scores (-4) that have much higher effects inferred from the global epistasis models. For example, I514D is only represented by single mutant variants, all with low functional scores. However, after global epistasis fitting, its functional effect becomes positive, which seems like an unintended problem.

To investigate this issue, I ran the same functional scores through global epistasis fitting using multiDMS v0.3.3, v0.4, and v1.1. While the functional effects are largely consistent between versions, some sites show significant differences. For instance, I514D has a functional effect of 0.14 in multiDMS v0.3.3, but -3.4 in multiDMS v0.4.

The multiDMS CHANGELOG notes that v0.4 included a bug fix "having to do with pandas groupby.apply 2.2.0," and v1.0.0 "fixes a bug where the phenotype predictions for single mutants did not correctly include the bundle effects."

Could you clarify whether these bugs would impact the global epistasis fitting results and potentially explain my results? The extent of their impact isn't clear to me. If these bugs significantly affect the results, should we be using the most recent version of multiDMS?

Thanks, Brendan

func_score_vs_global_epistasis_example

jbloom commented 1 week ago

@Haddox @jgallowa07, is there any reason I should not just address all of this by switching our pipeline to use multidms 1.1.0 rather than the 0.X versions?

Can you let me know if there are any concerns or things I should be aware of with that approach? Otherwise I will just do that.

jgallowa07 commented 6 days ago

I'm almost certain the reason 0.4.0 is working better for @bblarsen-sci is the softplus clipping we were discussing last but will need to dig into this more.

I don't necessarily see any reason we couldn't update to 1.1.0. The interface has changed slightly, but the docs should be up-to-date. I'm busy today, but should have time to help with this transition this week before break.

@Haddox Do you have any thoughts on switching to the latest model?

Haddox commented 6 days ago

Thanks, Jared. Switching to 1.1.0 seems fine to me.

One thing that the Bloom lab should be aware of is that the 1.X versions use a custom framework for optimizing model parameters that Will and Jared developed. This update resulted in improved convergence, which is good. But you should also know that Will and Jared are in the process of pivoting to a new framework that is simpler and faster.

So, the only concern I see is that the 1.1.0 optimization framework is somewhat complex and we will be shifting away from it soon. We validated that it works on simulated data and the spike data, and have no reason to believe that it wouldn't work well for you all. But, if 0.4.0 solves Brendan's problems, then that would avoid switching to a temporary framework.

To update the Bloom lab on timelines, Will and Jared's progress on pivoting to the new optimization framework has been stalled recently while they wrap up the replay manuscript. Once that is done, they plan to resume work on multidms. My understanding is that this will happen soon.

Jared -- I'd be curious to hear more about your thoughts on why the softplus clipping is fixing the problem Brendan observed. Let's have a video chat about that this week, and then we can update this thread.

On Mon, Nov 25, 2024 at 9:36 AM Jared Galloway @.***> wrote:

I'm almost certain the reason 0.4.0 is working better for @bblarsen-sci https://github.com/bblarsen-sci is the softplus clipping we were discussing last but will need to dig into this more.

I don't necessarily see any reason we couldn't update to 1.1.0. The interface has changed slightly, but the docs should be up-to-date. I'm busy today, but should have time to help with this transition this week before break.

@Haddox https://github.com/Haddox Do you have any thoughts on switching to the latest model?

— Reply to this email directly, view it on GitHub https://github.com/dms-vep/dms-vep-pipeline-3/issues/158#issuecomment-2498639987, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBLUANFY6N4ZJD67GCSVND2CNNYBAVCNFSM6AAAAABSKNOOOSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJYGYZTSOJYG4 . You are receiving this because you were mentioned.Message ID: @.***>

jbloom commented 6 days ago

So @Haddox @jgallowa07, sounds like for now maybe you would just recommend switching to v0.4.0 and then you would let us know when optimization framework updated in later versions, is that a good approach?

jgallowa07 commented 6 days ago

recommend switching to v0.4.0 and then you would let us know when optimization framework updated in later versions, is that a good approach?

I think I would agree, yes. While in some sense they're all temporary frameworks the 0.4.0 would be the simplest update until we get back on top of this project. 0.3.3 is very similar to 0.4.0, and was mainly cleaning up and patching pandas warnings.

@bblarsen-sci Hugh and I looked into the differences that might have caused the issues in your original email. We believe the cause may lie with a change in 0.4.0 which removed 64-bit precision from the internal parameters being optimized. In other words, multidms 0.4.0 now uses the default approach for jax optimization with 32-bit precision. Our best guess is that 32-bit models converge faster, and thus you're seeing a better fit with the same number of optimization iterations. Without further testing it's hard to say for sure whether or not this is the case but we've found no other probable changes that would have such a stark impact on fitting results. If you think it's worth investigating further, it may be interesting to try out different parameter initialization seeds (with 0.3.3) to see whether the set of mis-estimated mutations changes. If so, that would indicate the model has not fully converged (which could potentially be solved by simply increasing optimization iterations).

The new framework Will and I will be working on should solve convergence issues, and generally be speedier than the versions of multidms that exist currently. We'll be sure to keep the dms vep pipeline folks in the loop as soon as we continue that work.

jbloom commented 3 days ago

I merged changes that hopefully address this issue in #162, and am closing the issue for now.