Working with annotations

igarcia17 commented 10 months ago

Hello Dr. Yang

I am trying to implement annotations to my fine-maping analysis using your tool CARMA, which is being extremely useful for our research. However there are some issues that I'd like to discuss with you just to be sure.

How do you handle missing data? Many of my SNPs don't have complete records for all the annotaions that I use, for instance, many of them are intronic and don't have SIFT or PolyPhen scores. I see that in the example files of the vignette many entries have a value '0' for some annotations: should I interpret 'N/A' as 0 in my data?

I'd also like to ask you about your sources for the annotations. For the moment, I have been able to get data from GTex eQTLs and CADD scores, including their annotations, which makes a total of about 300 features. Unfortunately I am having trouble accesing some from the publication, such as the PolyFun ones. From your experience, which data base provides the most informative features?

Thank you in advance for reading me. Kind regards, Inés

ZikunY commented 10 months ago

Hi Ines,

For the annotation part, if the numeric values of the annotations reflect the strength or the credibility of snps being functional, then it is ok to set the missing annotations as 0.

Loosely speaking, the selection of annotations is quite arbitrary and very much depending on the data that you have, such as the features of phenotypes. Besides the common annotations, such as CADD, it would be better to check the genetics literatures to have an idea on what most effective or recognized annotations have been used in your field or related to the phenotypes of your data.

Please let me know if there is any other questions. Thanks.

Best, Zikun

igarcia17 commented 9 months ago

Many thanks for your valuable advice. I have been able to customize an annotation matrix for which I can be proud of. However, I tried what you suggested about the missing values and the outcome is quite unexpected. When I run CARMA with annotations, in a data frame that has missing values, I get some warnings related to glmnet() but it manages to go on, and I get my results. If i decide instead to transform the missing values to 0, using the command: annot[is.na(annot)] <- 0 I get the following error after the first 'burning time', and the function is not able to progress: Error in matrix(NA, nrow = 0, ncol = ncol(w.list[[1]])) : non-numeric matrix extent I also find strange that after a couple of months, I have run CARMA with the same data with no annotations and the results change drastically, as before there were 6 SNPs with a PIP above 0.99 whereas now the highest PIP is of 0.33. Do you have any suggestion about what is happening? Thank you very much for your collaboration. Kind regards, Ines

ZikunY commented 9 months ago

Hi Ines,

Sorry for the late response, it has been crazy for the past couple weeks.

First of all, if there were 6 SNPs with a PIP over 0.99, that almost surely meant data inconsistencies or the existence of outliers. The first thing to do is to check the Z-scores of these 6 SNPs or whatever the SNPs with 0.99 PIP, see if those Z-scores make sense or not. The current results, i.e. the highest PIP is 0.33, make much more sense given there is always LD to force the SNPs sharing the PIP.

Second, for the annotation part, I think it is more about data itself rather than CARMA. Check the status of your functional data before running CARMA would be recommended here, such as "is.numeric(w.list[[1]])". If the return is False, then double check the annotation input see if there is any mistake, sometimes it could happen.

Best, Zikun

ZikunY / CARMA

Working with annotations #18