mirna: how to better handle it.

cBioPortal / icebox

very low priority issues

0 stars 0 forks source link

mirna: how to better handle it. #57

Open jjgao opened 5 years ago

jjgao commented 5 years ago

The fake negative entrez gene IDs give us a lot of headache. Let me document how we generate them here and see if we can solve the issue.

The mirna id file reside in the id-mapping-mirbase.txt file under reference-data in our main mercurial repo. Also uploaded here.

This is the Java code to import this file.

Following the code, it uses DaoGene.addGeneWithoutEntrezGeneId.

following the code again, you’ll see getNextFakeEntrezId() is called to generate a negative ID:

Note that we also have negative entrez gene IDs for phosphoproteins too, so depending on who import first, the entrez IDs are different. So basically these IDs are generated randomly. :(

One way to solve it to separate the mirna ids into a separate table and connect them with entity table. (the same with phosphoprotein)

Another easier way to fix is assign negative entrez gene IDs in the id-mapping-mirbase.txt file, e.g start with -10000001.

@cbioportalpipelines

ritikakundra commented 5 years ago

Current documentation of how we handle miRNA:https://docs.google.com/document/d/1GeN2GHjSDFIc8hhl6v72JRoQbBw6szTRv7IDzcjXu5g/edit

jjgao commented 5 years ago

(copied from doc to here)

The Gene table in our database currently contains the miRNA to alias mapping maintained initially by JJ. This should not be updated with the new gene data.

The data sets contain the new MIRXXX naming. This is mapped to the pseudo miRNA that JJ maintains. Example:

MIR454 is present in our data files
This is one of the aliases being mapped to the pseudo: mir-454/454 in the database.
In the mapping table in reference-data/id-mapping-mirbase.txt, mir-454/454 is mapped to hsa-mir-454 mature and precursor miRNA.

Therefore as long as our data files have the aliases (MIRXXX) and the gene table has the older order, the code will import correctly and duplicate the entries that need to be duplicated.

We are duplicating for expression data and CNA data.
We are NOT duplication for mutations

Since the gene table has been reverted to older miRNAs, we just need to reimport all the studies (especially expression data).

sheridancbio commented 5 years ago

Comment: I was looking at the documentation for cBioPortal setup and tracing through the code for the microRNA importing with @yichaoS just now. The functionality described above (in ImportMicroRNAIDs.java) is called by ImportGeneData.java. But I don't see any reference in the current documentation telling users who are installing the portal to run the ImportGeneData functionality. Instead, users are told to initialize their database using the seed database -- importing directly using the mysql command line tool. This import does contain the positive numbered microRNA gene entries from NCBI and does not contain negatively numbered microRNAs. So if we want to support remote users being able to install/configure a gene database that contain the negatively numbered microRNA gene entries (and aliases) we should probably update the documentation showing how to run this code. There is a perl script for calling ImportMicroRNAIDs.java as a standalone module, but the main function is currently disconnected and it would not function properly. So a little tuning of code may also be needed.