ajaynadig / bhr

Suite of heritability and genetic correlation estimation tools for exome-sequencing data
MIT License
31 stars 6 forks source link

How to change the gene name properly from Genebass #15

Closed KazutoshiM closed 1 year ago

KazutoshiM commented 1 year ago

Dear BHR team, thanks for developing such a useful tool! I have some questions regarding the summary statistics files retrieved from the Genebass. I find that the gene names in downloaded files are 'gene symbol names' and I would like to transform them into 'ensemble ID' so that I can align them to my local summary statistics files and run the analysis. but I meet some problems.

I have tried several methods in R such as 'Biomart' and 'org.Hs.eg.db'. However, some gene names in the genebass turn to be the alias of the official symbol names (HGUC I suppose) so that I have some missing values after such transformation. Then, I tried to query the database that contains the alias symbol and the official symbol names but there exists a new problem that the same alias name may share different official symbol names and the whole process for mapping may be somehow complicated.

Thus, I would like to ask whether there is a neat and quick way to conduct the mapping process mentioned above? I notice that from your 'Height&BMI' example, the gene name is the ensemble ID and I would like to know your method. By the way, is the missing value of some genes meaningfully affect the final results (if not, I would omit them) since there still can be some missing gene names after the transformation process.

Thanks a lot! Kazutoshi

danjweiner commented 1 year ago

Hi Kazutoshi,

Thanks for reaching out!

You're right that mapping gene symbol names to ensemble IDs is an imperfect science, as there is occasionally ambiguity when converting between the naming conventions. I've uploaded a mapping file that we created to the reference_files directory of this repository. You'll notice that each gene name maps to no more than one emsemble ID, while often a single emsemble ID has multiple gene names. Feel free to use that conversion if you'd like.

To your second question about whether gene drop out from ambiguity is problematic for BHR: the short answer is probably not. BHR is a regression across genes, so dropping a few genes from the regression won't change the output meaningfully. It would be meaningful if you had dropout of important gene (e.g., APOB for LDL levels), but it is unlikely that important genes are affected by ambiguous naming schemes (something you can check on your own for your traits).

Hope this is helpful, Dan

ajaynadig commented 1 year ago

Closing issue, please let us know if you have any additional questions

KazutoshiM commented 1 year ago

Thanks so much for your prompt reply and I am using your file to do such transformation!