Pre-filtering steps and requirements per sample

joannawolthuis commented 2 years ago

Hi there! I'm faced with a dataset where I can have up to 90% missing values for a given feature (m/z value but equivalent to a protein readout), and that means there are many cases where some samples have all NA's for all of their 3 replicates. What would you recommend in terms of input filtering? Would you make sure that a feature has at least 1 replicate present for every sample? Or set some kind of maximum missing value threshold (ie. don't include features that have more than X% of samples missing)?

Currently I remove all m/z that have more than 90% of samples missing, but that leaves me with approx. 60 000 features and approx. 1500 samples.

I'm noticing that I can run the algorithm but it's not converging (at least not after 48 hours). Thanks :)

joannawolthuis commented 2 years ago

Some extra info: I have 460 samples with 3 replicates each. Multiple samples (and their replicates) can belong to a given condition, so that does make it slightly different compared to your example synthetic datasets (where it seems 1 sample belongs to 1 condition?).

const-ae commented 2 years ago

Hi Joanna,

Oh,, wow that sounds like an impressively big dataset! I have nelly tested the package with more than a few dozen samples.

To get an estimate how long proDA will run, subset your data (e.g., to 1,000 features and 10 samples) and see how long proDA runs. Then doubling the size of the data and measuring the duration again. This will allow you to extrapolate how long proDA might run for the full data. It might be that proDA's runtime will be unacceptably long. In that case, I recommend switching to an imputation-based approach like DEP, which should be magnitudes faster.

What would you recommend in terms of input filtering? Would you make sure that a feature has at least 1 replicate present for every sample? Or set some kind of maximum missing value threshold [...]?

I find it difficult to give generic advice on this without a better understanding of your data and your research question. In general, it is not necessary to filter out samples where the protein is missing in all replicates, proDA can handle such cases. In fact, these could be particularly interesting because completely missing a protein in one condition and completely observing it in another could indicate a large fold-change. However, if a protein is missing in most samples, it will be difficult to reliably establish if it changes between conditions, so you can save time by not considering it.

Best, Constantin

joannawolthuis commented 2 years ago

Hi Constantin, thanks for your reply and time :) I discovered that with over 90% missing values for a given m/z, the algorithm won't run (messages such as 'system is computationally singular') but at 80% it will work fine. I can take a look at DEP but I really like how proDA finds meaning in missing values.

I have quite some more questions and I'm wondering if the GitHub issues tab is the most suitable place to discuss them, haha. Is it ok if I ask them in this thread? If not, please let me know how I should best contact you.

(asked earlier, you can close that issue if you want) If it's possible to calculate the fold-change between conditions, similarly to how you calculate the 't-test' using the posteriors(?) for a given protein in each condition?
If I wanted to train a classifier on the completed matrix, would I need to run proDA separately on training and testing sets to prevent information leakage?
Would you think it's safe to train a classifier on the completed matrix at all? Since you did not recommend univariate testing on the completed matrix.

const-ae / proDA

Pre-filtering steps and requirements per sample #17