cBioPortal / icebox

very low priority issues
0 stars 0 forks source link

Study loader warns user if Entrez_Id is not found, even for non-coding genes #552

Open forus opened 9 months ago

forus commented 9 months ago

I loaded chol_tcga_pan_can_atlas_2018 study to cBioPortal (hg19_hg38_v2.13.0 seed data).

I got a lot of warnings like the following:

Warnings / Errors:
-------------------
0.  Entrez_Id 100033819 not found. Record will be skipped for this gene.; 1x
1.  Entrez_Id 100093698 not found. Record will be skipped for this gene.; 1x
2.  Entrez_Id 100101148 not found. Record will be skipped for this gene.; 1x
...

I've made a further investigation of these genes. 2130 Entrez IDs in the chol_tcga_pan_can_atlas_2018 data could not be found in the recent cBioPortal database.

See this table for detailed list of Entrez Ids and their classification: chol_tcga_pan_can_atlas_2018_missing_entrez_ids_classified.txt

I doubt this information should be represented as warnings and in such a verbose view (line per Entrez ID per file) for non-coding genes. The risk of doing so is to devaluate the concept of warnings; people start ignoring them altogether.

Filtering out this data seems the straightforward thing to do.

As a user, I would like to get short information (not classified as a warning/error) on how many records were skipped because they were associated with non-coding genes.

forus commented 9 months ago

There are several ideas to make it possible:

cBioPortal study loader should be aware of which gene is non-coding. One way to achieve that is by querying Genome Nexus. The gene list should be more complete and include deprecated, uncharacterized ncRNA and pseudogenes, just to know which we can safely skip.