NA dropout_curve parameters

amackey commented 2 years ago

If the proDA fit is performed using a subset of rows (n_subsample), and if that subset (the first n_subsample rows of the dataset) has no missing values for a particular sample (column), then dropout_curves() can't generate parameter estimates, and returns NA for the curve coefficients of that sample. When later those dropout curve parameters (including the NA's) are used to predict features for the rest of the dataset, and if/when dropouts are now encounted in the samples having NA-parameter curve coefficients, the objective_fcn call seems to go "boom". I'd guess the right "fix" would be to recalc the dropout curve parameters "one last time" after the model has converged, using all the data (at least in this edge case).

const-ae commented 2 years ago

Hmm, interesting edge case; I hadn't considered that. I guess, for now, you could just use n_subsample = nrow(data) to avoid this issue, right?

I'd guess the right "fix" would be to recalc the dropout curve parameters "one last time" after the model has converged, using all the data

This sounds reasonable to me. Would you be interested in making a PR that checks that we are indeed in this specific case (i.e., no NA's in the first n rows, but NA's in the remaining rows) and then calculates the dropout curves using the full dataset?

amackey commented 2 years ago

Yes, if you use all the data (no sub-sampling), this problem does not occur.

I think another, more consistent alternative would be to re-work the dropout curve parameters as random effects, rather than fixed ML estimates, as I did in my simulation notebook: https://rpubs.com/ajmackey/proDA

Because even if you do not observe any dropouts in a given set of peptides (or proteins), it doesn't mean that the sample doesn't have some underlying dropout curve, that you would observe were you to keep "sampling" more peptides/proteins.

thanks for your consideration, -Aaron

On Tue, Oct 4, 2022 at 3:35 AM Constantin @.***> wrote:

Hmm, interesting edge case; I hadn't considered that. I guess, for now, you could just use n_subsample = nrow(data) to avoid this issue, right?

I'd guess the right "fix" would be to recalc the dropout curve parameters "one last time" after the model has converged, using all the data

This sounds reasonable to me. Would you be interested in making a PR that checks that we are indeed in this specific case (i.e., no NA's in the first n rows, but NA's in the remaining rows) and then calculates the dropout curves using the full dataset?

— Reply to this email directly, view it on GitHub https://github.com/const-ae/proDA/issues/19#issuecomment-1266527777, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACDCMBSCFFRDX4XRYIRTWDWBPM3XANCNFSM6AAAAAAQ3VUHGM . You are receiving this because you authored the thread.Message ID: @.***>

const-ae / proDA

NA dropout_curve parameters #19