andreyshabalin / MatrixEQTL

Matrix eQTL: Ultra fast eQTL analysis via large matrix operations
58 stars 17 forks source link

Excluding NA instead of imputing #24

Closed nienkevanunen closed 1 year ago

nienkevanunen commented 1 year ago

Hi, if I understand correctly, missing values are imputed using the mean. In my data, the same genes are not measured for each person, so I have a lot of missing data. I do not want to impute this data, I simply want to exclude the people that don't have this gene measured. Right now, my only solution to this is to run MatrixEQTL separately per gene, so that I can exclude specific people that were not measured for each gene. By doing this, however, I kind of lose the benefit of how fast MatrixEQTL is for large data, and it makes my whole pipeline take much longer.

andreyshabalin commented 1 year ago

Hi Nienke,

I have only one suggestion. You may get better performance if you run genes in groups of the same missingness pattern.

Namely, you can run all genes without missing values in one batch. Then, run all the genes with just sample 1 missing in one batch.

Depending on the pattern of missingness of the missing values in your data this may give you a great improvement in performance.

Andrey

nienkevanunen commented 1 year ago

Thank you for your suggestion. Unfortunately that won't really work for my specific dataset. Majority have different missingness patterns, not just because they weren't measured but also "out-of-detection limit" measurements were filtered out. So in the end nearly every gene (I'm not actually using genes but for the sake of simplicity I'm calling it that 😛 ) ends up having a unique group of samples.

andreyshabalin commented 1 year ago

Hi Nienke,

Sorry about that. I can only offer my help with fast subsetting the columns using the SlicedData class. http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/html/SlicedData-class.html and parallelizing across CPU cores with parallel package.

Andrey

nienkevanunen commented 1 year ago

Hi Andrey,

I see. Right now what I do is loop through each row of my dataframe, exclude missing columns, and then make the 1 row into a SlicedData since this is required for the mapping. But you're saying it is better to do this the other way around; making my whole df into a SlicedData object, and then looping through each slice with e.g. getSlice()? However, how do I then exclude all the columns with missing values for each slice?

andreyshabalin commented 1 year ago

Hi Nienke,

My idea was to use snps$RowReorder(ordr) to subset the snps SlicedData object to the selected set of samples. In any case, I'm glad you are doing this relatively efficiently, without temporary files.

If I was running this analysis on my computer, I would focus on parallalizing the analysis across 64 CPU cores.

A side question, what are the dimensions of your data (numbers of samples, 'snps', 'genes', and covariates?

Andrey