Open jjgao opened 5 years ago
Current documentation of how we handle miRNA:https://docs.google.com/document/d/1GeN2GHjSDFIc8hhl6v72JRoQbBw6szTRv7IDzcjXu5g/edit
(copied from doc to here)
The Gene table in our database currently contains the miRNA to alias mapping maintained initially by JJ. This should not be updated with the new gene data.
The data sets contain the new MIRXXX
naming. This is mapped to the pseudo miRNA that JJ maintains. Example:
Therefore as long as our data files have the aliases (MIRXXX) and the gene table has the older order, the code will import correctly and duplicate the entries that need to be duplicated.
Since the gene table has been reverted to older miRNAs, we just need to reimport all the studies (especially expression data).
Comment: I was looking at the documentation for cBioPortal setup and tracing through the code for the microRNA importing with @yichaoS just now. The functionality described above (in ImportMicroRNAIDs.java) is called by ImportGeneData.java. But I don't see any reference in the current documentation telling users who are installing the portal to run the ImportGeneData functionality. Instead, users are told to initialize their database using the seed database -- importing directly using the mysql
command line tool. This import does contain the positive numbered microRNA gene entries from NCBI and does not contain negatively numbered microRNAs. So if we want to support remote users being able to install/configure a gene database that contain the negatively numbered microRNA gene entries (and aliases) we should probably update the documentation showing how to run this code. There is a perl script for calling ImportMicroRNAIDs.java as a standalone module, but the main function is currently disconnected and it would not function properly. So a little tuning of code may also be needed.
The fake negative entrez gene IDs give us a lot of headache. Let me document how we generate them here and see if we can solve the issue.
The mirna id file reside in the id-mapping-mirbase.txt file under reference-data in our main mercurial repo. Also uploaded here.
This is the Java code to import this file.
Following the code, it uses DaoGene.addGeneWithoutEntrezGeneId.
following the code again, you’ll see getNextFakeEntrezId() is called to generate a negative ID:
Note that we also have negative entrez gene IDs for phosphoproteins too, so depending on who import first, the entrez IDs are different. So basically these IDs are generated randomly. :(
One way to solve it to separate the mirna ids into a separate table and connect them with entity table. (the same with phosphoprotein)
Another easier way to fix is assign negative entrez gene IDs in the id-mapping-mirbase.txt file, e.g start with -10000001.
@cbioportalpipelines