KechrisLab / multiMiR

Development repository for the multiMiR database's R API
Other
20 stars 3 forks source link

Entrez IDs in results table match to multiple genes #55

Closed mattbcvs closed 2 months ago

mattbcvs commented 10 months ago

Hi there,

Reporting what seems like an error in that I have multiple gene symbols in my predicted targets that match to the same Entrez ID.

801 matches to CALM1 (as expected https://www.ncbi.nlm.nih.gov/gene/?term=801%5Buid%5D) but also to CALM2 (seems an error: https://www.ncbi.nlm.nih.gov/gene/?term=805%5Buid%5D) and also CALM3 (seems an error https://www.ncbi.nlm.nih.gov/gene/808)

Seems a possible bug to fix?

Thanks, Matt

smahaffey commented 8 months ago

@mattbcvs Thank you. Yes I have struggled with IDs for each update as each source at this point may be using different versions of IDs some of which are quite out of date, so for some IDs multiple older IDs have merged or some IDs don't exist in current annotations.

With the next version once each source is loaded and sources that aren't being updated are copied over then I will make an attempt to find and resolve these types of errors.

If anyone would be willing to help beta test and help find some of these instances it would be extremely helpful. Just as you did, to mention specific genes that should not match but do in some instances. We can try to track these down and then find similar instances where this occurred to resolve these issues.

I have been able to develop scripts when the changes are more straight forward, but there may be many more that will require manually reviewing and then making appropriate changes to resolve these issues.

smahaffey commented 2 months ago

I have done what I can to resolve these types of issues in the database version 2.4. I expect to make the updated database public in the next day or two.

There are still related issues where Gene Symbols link to multiple Entrez(NCBI) IDs or Ensembl IDs. Where possible I've been able to consolidate targets with missing information to the same target entry in the database. ID's that I have not been able to resolve usually involve Gene Symbols that link to multiple Entrez and Ensembl IDs. This generally occurs because of old IDs that have been discontinued in one of those databases or get split into multiple IDs etc.

Without time to manually curate these IDs there are few options. We can work to build a pipeline to look up each ID in the source database and try to replace it with the current ID. I think in most cases this can't be automated well because it isn't as straight forward and following linear links to a current ID.

I will work on modifying the R package by flagging results that when found by the ID used in the search may reflect partial or overly inclusive results depending on which ID is used to look up a gene or add an option to expand or limit the results based on target IDs used.