General RefFreeEWAS Questions: Removing Confounding Probes and Proportions Not Summing to 1

bcm-uga / medepir

MEthylation DEconvoluation PIpeline in R

https://bcm-uga.github.io/medepir/

3 stars 3 forks source link

General RefFreeEWAS Questions: Removing Confounding Probes and Proportions Not Summing to 1 #3

Closed danphillips28 closed 3 years ago

danphillips28 commented 3 years ago

Dear medepir developers, Thank you for this lovely tool, it has made learning how to deconvolute my Infinium EPIC data much easier than it would otherwise have been, and the paper was a very pleasant read too. I have been playing around with deconvoluting my data, largely following this https://bcm-uga.github.io/medepir/articles/deconvolution.html pipeline. I have a some quick questions.

When removing confounding probes, am I correct in thinking that we remove those that correlate with anything in our metadata, including our factors of interest (e.g. tissue type, in my case). I have seen instruction on this written differently in different places, which has led me to be slightly confused. E.g., in the pipeline I've been following, the entire metadata file is given to the CF_detection function, as such; D_CF = medepir::CF_detection(D, exp_grp, threshold = 0.15, ncores = 2) This will therefore include the factors of interest. However, elsewhere (e.g. https://rpubs.com/paujedynak/reffree_cell_mix_tutorial) certain confounders are in fact selected, which do not seem to include the factors of interest; # Choose factors that are affecting methylation but not influencing cell proportions convariates_dataset <- convariates_dataset %>% dplyr::select(factor_1, factor_2, factor_3, factor_4) Whilst the above does not explicitly not include factors of interest, when discussed, such as in the original article, factors often mentioned for inclusion in this function are e.g. age and sex, which are typically nuisance factors.
After running RefFreeEWAS using medepir using the code below; convariates_dataset <- targets %>% dplyr::select(Slide, age.at.Biopsy, Depot, Sex) selected_probes <- medepir::CF_detection(D = bVals, exp_grp = convariates_dataset) medepir::plot_k(selected_probes) K_optimal_scree <- 4 cell_mix <- medepir::RFE(selected_probes, nbcell = K_optimal_scree) deconvolution_proportions <- (cell_mix$A)

I get the following proportions; Deconvoluting Infinium-page-002

I would like to know why the proportions do not sum to 1 for each sample for RefFreeEWAS. This is also the case in the example results in the pipeline I shared in the first paragraph, but doesn't seem to be an issue for EDec.

Finally, I would like to ask for some pointers on how to progress with these results. Downstream I am running straight-forward differentially methylated probe, region, and maybe block analyses, primarily with limma and DMRcate. Importantly, we are not exactly interested in the cell type complexity, but instead are interested in the methylation differences between conditions for one cell type only. Would the solution be to include the proportions of different cell types in each sample as covariates in the limma model matrix? Any pointers would be greatly appreciated.

Best wishes, Daniel

paujedynak commented 3 years ago

Dear Daniel, thanks for your query. I am not a developer of the medepir package but I used to use the medepir pipeline and Houseman's methods in my research, so perhaps my experience may be helpful. Just to clarify, I worked on DNA methylation in placenta in association with some exposure of the mother earlier during pregnancy (e.g., https://doi.org/10.1016/j.envpol.2021.118024). The nature of my work is different from cancer research and this may explain some of the methodological discrepancies.

I will try to address your questions point by point. 1) My understanding of the probe selection proposed by Decamps et al. is that you filter out probes (linearly) associated with factors that may affect methylation but not the cell mix, in order to account for these factors while using the methylation data to estimate cell proportions. In the example provided in Decamps et al. paper, the entire metadata file is given to the CF_detection function, and the confounding factors are: maternal age and child sex (the authors also suggest to add information on smoking status and technical factors, if available). Since "affect methylation but not the cell mix" may mean different things, in my project I considered two scenarios (and maybe medepir developers could also comment on my approach): 1) variables affecting methylation but not associated in 'real life' with the cell mix (e.g., I would not consider variables that happen after the constitution of the cell mix, in my case delivery mode that happens after placental cell proportions constitution); 2) all variables associated with methylation (ignoring the 'real life' chronology explained above). I tested both scenarios, also using different thresholds for significant associations (0.05 and 0.15 as recommended by Decamps et al.). Independently form scenario used, I obtained relatively similar number of retained probes (between 2,000 and 10,000 out of ~400,000) and the same K. Taking this into account, I decided to choose the fist approach as more logical plus following the sparsity rule. This is why, in the example provided in the tutorial, I selected potentially confounding variables and did not use the entire metadata.

The cell proportions may not sum up to one - this is normal due to the RefFreeEWAS convergence algorithm that applies some constrains to the estimates. I cannot find the reference now, but I am pretty sure I saw it in either Houseman et al. 2016 or supplementary material for this paper.

Regarding your final question, you said that you are not exactly interested in the cell type complexity, but instead in the methylation differences between conditions for one cell type only. This is a bit confusing to me as since you are using a reference-free method for your cell type estimation (and not the reference-based), I think you cannot make biologically relevant inference about methylation changes in a particular cell type. This is because the reference-free estimates are surrogate variables for cell types and they do not necessarily represent a 'real' cell composition of the tissue, neither in the number of cell types neither in their proportions. In my understanding, by estimating the cell mix you just want to capture the part of methylation variability that is explained by factors other than your predictor variable(s) so you can add your cell proportion estimates as confounders in your downstream analysis, e.g., regression models, as you would do with other confounders. I am not sure if it is clear, please do not hesitate to contact me directly if you would like to discuss it further!

Good luck with your research!

danphillips28 commented 3 years ago

Dear Paulina, Thank you very much for your prompt and detailed response, I really appreciate it! I hope you don't mind me following-up with more questions right away. I also prefer to post here rather than directly so that others with similar questions might find this and benefit, as I have on Github many times before.

I will reply to your last point first, to give context to the other two. In this project I have methylation data for two different fat tissues belonging to a number of people (paired design). The aim is to compare the methylome of the fat in these two tissues, then later to overlap the methylome results with RNA-Seq data from the same samples. In fact, I am already at the point of overlapping the two datasets, but given the samples come from human biopsies, we decided cell type mixtures ought to be controlled-for when modelling differential methylation (and transcription). Given that no methylation reference material seems to be available for one of our tissue types, we have opted for a reference-free method.

Unfortunately I am still struggling with the precise meaning of "affect methylation but not the cell mix". For my call to CF_detection I have included (array) slide, age.at.Biopsy, tissue.type, and sex. It would seem that slide, as well as any other typical "batch" factors such as date and so on, should be included. However, from your explanation I would also suppose that tissue.type, sex, and probably even age should be removed, because they are likely associated with real and important differences in both methylation the cell mixture. Have I understood this correctly?
This is clear enough, and much what I expected. Thank you.
I just wanted to clarify one final thing in response to something you said. You said " the reference-free estimates are surrogate variables for cell types and they do not necessarily represent a 'real' cell composition of the tissue, neither in the number of cell types neither in their proportions." This has caught me a little off-guard, and is without doubt due to my naivety when it comes to statistics as I am very much beginning my bioinformatics journey. Simply put, does this statement mean that i.e. when looking at my cell mixture plots above, the proportions of each colour do not actually reflect how much of a predicted cell type is estimated to be present in the mixture, however crudely? I had interpreted the plots to show that there is a lot more heterogeneity in the biopsies than expected, given that we were trying to sample only one cell type.

Thanks again for your detailed response and many apologies for my simplistic questions!

paujedynak commented 3 years ago

Hi Daniel, I guess I have misunderstood what you meant by "differences between conditions for one cell type only". After your explanation I guess that by 'cell type' you meant your experimental group 1 (fat tissue 1) vs. experimental group 2 (fat tissue 2), and not one of the reference-free estimated cell types. Is this correct?

1) Yes you understood correctly. I tried both options: after including all variables suspected to affect methylation I removed probes that were significantly associated with any of these variables, basing on the p-value; in the second configuration I did the same but I a priori restricted the set of variables suspected to affect methylation to those that cannot influence cell composition due to timing (e.g., delivery mode or batch effects). The cell estimations were comparable (but the downstream analyses not too much, although they had a common trait), and I decided to use the restricted set, as explained before. I think there is no consensus about this and different authors use different confounder sets (see also https://glint-epigenetics.readthedocs.io/en/latest/tissueheterogeneity.html#refactor - another commonly used reference-free method for cell proportion estimations). I think it can be helpful to think about your research question in the light of the trade-off: using too many variables will leave you with fewer probes and you risk loosing some variability in the methylation data that you actually would like to preserve, as associated with your predictor of interest (especially that RefFreeEWAS method is considered as a conservative one and capturing most of the methylation variability thus reducing the number of hits associated with the variables of interest, while ReFACTor is believed to be less stringent); using too few variables (or no probe selection at all) may result in noise and false positive associations with cell composition and not your predictor of interest, in the downstream analysis. In the past, people also used 10,000 most variable probes to estimate the cell proportions (including Houseman), so you may also try this approach to see how the cell mix will differ, to have some point of reference. At the end of the day, my experience is that the factor most influencing the downstream analyses and connected to cell mix is K (the number of cell types) which in my case was rather stable between the different approaches. I guess @magrichard or @CDecamps could add some valuable input in this topic.

Since what reference-free methods do is calculation of linear transformations of the cell type composition rather than estimation of the absolute cell count values, I would be very careful when assigning any biological meaning to the estimated proportions, unless you can obtain some reference-based estimations for comparison. We made comparisons between cell mixes obtained using RefFreeEWAS and reference-based methods (planet and methylCC) on placental tissue and the reference-free estimates were not comparable to the reference-based ones. Note: Although reference-free and reference-based cell proportion estimates may differ substantially, this does not mean that the downstream analyses will yield very different results after adding these cell proportions to your regression equation; the results may still may be comparable if the reference-free cell mix (especially K) was estimated correctly.

magrichard commented 3 years ago

Dear Daniel,

Thank you for your interest in the pipeline. And thank you @paujedynak for your detailed answers!

Please find below a point by point comments on the ongoing discussion:

To follow up with Paulina's comment: defining which confounders you want to include in your analysis is really up to you, you are the data-analysist and the expert of your specific scientific question. There is always a risk to remove probes also associated with the biological processes you are interested in. This is why it is very important to bring your own expertise to identify the relevant biological priors you need to use in your analysis framework.
Yes indeed, the sum to 1 is defined ny the constrains associated with the deconvolution algorithms, which differ between RefFreeEWAS, Edec and MeDeCom.
The difficulty in using refrence-free methods relies in the biological interpretation of the identified factors. I would be less categorical than Paulina on these aspects. I think it is of interest to look for biological meaning. However you need to keep in mind that you do not per se identify cell types, but more probably a combination of cell functions. A classic way to start with the biological interpretation would be to run a enrichment analysis on the T matrix (i.e. factor reference profiles). You can also look at the some markers of interest.

danphillips28 commented 3 years ago

Thank you both very much for your time. I have learnt a lot from your responses and I'm sure others will down the line as well.

I have decided to keep only those covariates I am sure are totally unimportant for my biological question, namely ChipID and slide.

I will play around with more methods but feel I will ultimately end up returning to this one, and will hopefully feel more sure about what I've done once I properly include the estimations in my modelling and see the results. Until then... Thanks again! Daniel