Open amackey opened 2 years ago
Hmm, interesting edge case; I hadn't considered that. I guess, for now, you could just use n_subsample = nrow(data)
to avoid this issue, right?
I'd guess the right "fix" would be to recalc the dropout curve parameters "one last time" after the model has converged, using all the data
This sounds reasonable to me. Would you be interested in making a PR that checks that we are indeed in this specific case (i.e., no NA
's in the first n
rows, but NA
's in the remaining rows) and then calculates the dropout curves using the full dataset?
Yes, if you use all the data (no sub-sampling), this problem does not occur.
I think another, more consistent alternative would be to re-work the dropout curve parameters as random effects, rather than fixed ML estimates, as I did in my simulation notebook: https://rpubs.com/ajmackey/proDA
Because even if you do not observe any dropouts in a given set of peptides (or proteins), it doesn't mean that the sample doesn't have some underlying dropout curve, that you would observe were you to keep "sampling" more peptides/proteins.
thanks for your consideration, -Aaron
On Tue, Oct 4, 2022 at 3:35 AM Constantin @.***> wrote:
Hmm, interesting edge case; I hadn't considered that. I guess, for now, you could just use n_subsample = nrow(data) to avoid this issue, right?
I'd guess the right "fix" would be to recalc the dropout curve parameters "one last time" after the model has converged, using all the data
This sounds reasonable to me. Would you be interested in making a PR that checks that we are indeed in this specific case (i.e., no NA's in the first n rows, but NA's in the remaining rows) and then calculates the dropout curves using the full dataset?
— Reply to this email directly, view it on GitHub https://github.com/const-ae/proDA/issues/19#issuecomment-1266527777, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACDCMBSCFFRDX4XRYIRTWDWBPM3XANCNFSM6AAAAAAQ3VUHGM . You are receiving this because you authored the thread.Message ID: @.***>
If the proDA fit is performed using a subset of rows (n_subsample), and if that subset (the first
n_subsample
rows of the dataset) has no missing values for a particular sample (column), thendropout_curves()
can't generate parameter estimates, and returnsNA
for the curve coefficients of that sample. When later those dropout curve parameters (including the NA's) are used to predict features for the rest of the dataset, and if/when dropouts are now encounted in the samples having NA-parameter curve coefficients, theobjective_fcn
call seems to go "boom". I'd guess the right "fix" would be to recalc the dropout curve parameters "one last time" after the model has converged, using all the data (at least in this edge case).