CRI-iAtlas / iatlas-data

MOVED TO GITLAB -> https://gitlab.com/cri-iatlas/iatlas-data.git
1 stars 0 forks source link

Redo gene ids #76

Closed andrewelamb closed 4 years ago

andrewelamb commented 4 years ago

The data base will have the entrez to hugo mapping using this file:

https://www.synapse.org/#!Synapse:syn21788372

All other data sources that make use of genes should be mapped to entrez with a mapping appropriate to the project they came from.

All non driver mutation TCGA files should use the mapping provided in the tcga expression file:

https://www.synapse.org/#!Synapse:syn4976369


One gene in the expression file has two entries with different entrez ids:

hugo    entrez     TCGA-OR-A5J1-01A-11R-A29S-07 

1 SLC35E2 728661 3293.2000

2 SLC35E2 9906 35.3314

In cases where another TCGA file makes use of SLC35E2 but does not have an associated entrez ID use 9906 which will map to SLC35E2A in the future.


For Driver mutations these 14 genes do not appear in the expression file:

entrez hgnc

<int> <chr>   

1 84962 AJUBA

2 139285 AMER1

3 2909 ARHGAP35

4 3125 HLA-DRB3

5 3126 HLA-DRB4

6 284058 KANSL1

7 3803 KIR2DL2

8 4297 KMT2A

9 9757 KMT2B

10 58508 KMT2C

11 8085 KMT2D

12 57466 SCAF4

13 6427 SRSF2

14 7114 TMSB4X

For these 14 genes use the entrez ids that Shane found(listed). This may result in entrez ids with more than one mutation. This is OK if they have different mutation codes. If they have identical mutation codes we won't be able to include them, and drop the mutation that corresponds to one of the above genes.