bmeg / bmeg-etl

ETL configuration for BMEG
1 stars 2 forks source link

graph connectivity results #212

Closed adamstruck closed 6 years ago

adamstruck commented 6 years ago
# Unconnected Edge Label
79956 "AlleleIn"
444 "CallsetFor"
27825 "DrugResponseIn"
421326 "ExonFor"
511 "ExpressionOf"
30061 "GeneOntologyAnnotation"
607 "HasAlleleFeature"
13 "HasGeneFeature"
28693 "HasSupportingReference"
2 "MinimalAlleleIn"
5996 "PFAMClanMember"
34405 "ProteinFor"
138232 "StructureFor"
2719 "TranscriptFor"
adamstruck commented 6 years ago

770,790 / 13,564,352 edges reference a vertex that does not exist

bwalsh commented 6 years ago

Re. "HasAlleleFeature" - should be in new file

{"From":"G2PAssociation:6801ec648545fc2e6a6a9af1370ee46249e12256","Gid":"(G2PAssociation:6801ec648545fc2e6a6a9af1370ee46249e12256)--HasAlleleFeature-\u003e(Allele:b386107073e53ba055c55b25f9d3b2eba4189f74)","Label":"HasAlleleFeature","To":"Allele:b386107073e53ba055c55b25f9d3b2eba4189f74","level":"error","msg":"To does not exist","time":"2018-09-26T21:45:35Z"}

grep b386107073e53ba055c55b25f9d3b2eba4189f74 Allele.Vertex.json
{"_id": "Allele:b386107073e53ba055c55b25f9d3b2eba4189f74", "gid": "Allele:b386107073e53ba055c55b25f9d3b2eba4189f74", "label": "Allele", "data": {"genome": "GRCh37", "chromosome": "12", "start": 57865493, "end": 57865493, "reference_bases": "A", "alternate_bases": "G", "annotations": {"maf": null, "mc3": null, "ccle": null, "myvariantinfo": null}}}
bwalsh commented 6 years ago

Re. "HasGeneFeature" - unclear why it doesn't exist

{"From":"G2PAssociation:56a202bd7a674a4f8d36620b2c6fcdc869adba8e","Gid":"(G2PAssociation:56a202bd7a674a4f8d36620b2c6fcdc869adba8e)--HasGeneFeature-\u003e(Gene:ENSG00000130600)","Label":"HasGeneFeature","To":"Gene:ENSG00000130600","level":"error","msg":"To does not exist","time":"2018-09-26T21:45:35Z"}

http://uswest.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000130600;r=11:1995176-2001470

shows it exists in build 37

spot checking

{"From":"G2PAssociation:1b93d3c9fd981f432456a061db01129a8ec5735d","Gid":"(G2PAssociation:1b93d3c9fd981f432456a061db01129a8ec5735d)--HasGeneFeature-\u003e(Gene:ENSG00000270141)","Label":"HasGeneFeature","To":"Gene:ENSG00000270141","level":"error","msg":"To does not exist","time":"2018-09-26T21:45:35Z"} 

https://uswest.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000270141;r=3:169764520-169765060;t=ENST00000602385

shows the same result

bwalsh commented 6 years ago

Re. DrugResponseIn

{"From":"DrugResponse:(5Z)-7-Oxozeaenol:A253_UPPER_AERODIGESTIVE_TRACT","Gid":"(DrugResponse:(5Z)-7-Oxozeaenol:A253_UPPER_AERODIGESTIVE_TRACT)--DrugResponseIn-\u003e(Biosample:A253_UPPER_AERODIGESTIVE_TRACT)","Label":"DrugResponseIn","To":"Biosample:A253_UPPER_AERODIGESTIVE_TRACT","level":"error","msg":"To does not exist","time":"2018-09-26T21:45:38Z"}

This looks similar to #203

adamstruck commented 6 years ago

Updated results from today (includes data generated by #213)

# Unconnected Edge Label
11345 "AlleleIn"
444 "CallsetFor"
27825 "DrugResponseIn"
511 "ExpressionOf"
28641 "GeneOntologyAnnotation"
10 "HasGeneFeature"
28693 "HasSupportingReference"
2 "MinimalAlleleIn"
5996 "PFAMClanMember"
21500 "ProteinFor"
138232 "StructureFor"
adamstruck commented 6 years ago

Re: "AlleleIn" edges

These 11,345 edges reference 256 gene vertices that do not exist

bwalsh commented 6 years ago

Re. MinimalAlleleIn Also pointing at genes w/out GRCh37 equivalent

bwalsh commented 6 years ago
Unconnected Edge Label # Unconnected improvement(neg. is good) comment
79956 "AlleleIn" 11898 -68058 genes
444 "CallsetFor" 115 -329 ccle_callset
27825 "DrugResponseIn" 1325 -26500 ccle_callset ?
421326 "ExonFor" 0 -421326 done!
511 "ExpressionOf" 337 -174 ccle_callset ?
30061 "GeneOntologyAnnotation" 28641 -1420  
607 "HasAlleleFeature" 0 -607 done!
13 "HasGeneFeature" 10 -3  
28693 "HasSupportingReference" 28693 0  
2 "MinimalAlleleIn" 2 0  
5996 "PFAMClanMember" 5996 0  
  PhenotypeOf 7 7  
34405 "ProteinFor" 21500 -12905  
138232 "StructureFor" 138232 0  
2719 "TranscriptFor" 0 -2719 done!
bwalsh commented 6 years ago

grep "does not exist" check-graph.out | jq -c 'select(.level == "error" and .msg == "To does not exist")' | jq .Label | sort | uniq -c

 Unconnected Edges

new_count new_label old_count delta Comment
9759 COCAClusterFor   9759  Individual
11898 AlleleIn 11345 553 Gene
29178 HasSupportingReference 28693 485 Pubs
10 HasGeneFeature 10 0 Gene
2 MinimalAlleleIn 2 0 Gene
21500 ProteinFor 21500 0 Transcript
  "CallsetFor" 444    
  "DrugResponseIn" 27825    
  "ExpressionOf" 511    
28565 GeneOntologyAnnotation 28641 -76 Gene
  "PFAMClanMember" 5996    
  "StructureFor" 138232    

comments

$ grep TCGA-ZJ-AAXT outputs/gdc/Individual.Vertex.json {"_id": "Individual:5916af71-0262-42cf-af4b-eac830a8b419", "gid": "Individual:5916af71-0262-42cf-af4b-eac830a8b419", "label": "Individual", "data": {"individual_id": "5916af71-0262-42cf-af4b-eac830a8b419", "gdc_attributes": {"diagnoses": [{"classification_of_tumor": "not


### summary

aliquot | biosample | individual | project_id ---------+-----------+------------+----------------------------------------- 1 | 1 | 1 | CCLE:ADRENAL_CORTEX 33 | 33 | 33 | CCLE:AUTONOMIC_GANGLIA 14 | 14 | 14 | CCLE:BILIARY_TRACT 2 | 2 | 2 | CCLE:BLADDER 78 | 78 | 78 | CCLE:BONE 93 | 93 | 93 | CCLE:BREAST 1 | 1 | 1 | CCLE:BUCCAL 91 | 91 | 91 | CCLE:CENTRAL_NERVOUS_SYSTEM 16 | 16 | 16 | CCLE:CERVIX 30 | 30 | 30 | CCLE:ENDOMETRIUM 4 | 4 | 4 | CCLE:ENGINEERED 5 | 5 | 5 | CCLE:ESOPHAGUS 2 | 2 | 2 | CCLE:FIBROBLAST 1 | 1 | 1 | CCLE:GASTROINTESTINAL_TRACT 269 | 269 | 269 | CCLE:HAEMATOPOIETIC_AND_LYMPHOID_TISSUE 17 | 17 | 17 | CCLE:HEAD_AND_NECK 57 | 57 | 57 | CCLE:KIDNEY 77 | 77 | 77 | CCLE:LARGE_INTESTINE 28 | 28 | 28 | CCLE:LIVER 272 | 272 | 272 | CCLE:LUNG 40 | 40 | 40 | CCLE:MATCHED_NORMAL_TISSUE 26 | 26 | 26 | CCLE:NERVOUS_SYSTEM 33 | 33 | 33 | CCLE:OESOPHAGUS 68 | 68 | 68 | CCLE:OVARY 57 | 57 | 57 | CCLE:PANCREAS 2 | 2 | 2 | CCLE:PLACENTA 11 | 11 | 11 | CCLE:PLEURA 10 | 10 | 10 | CCLE:PROSTATE 2 | 2 | 2 | CCLE:SALIVARY_GLAND 108 | 108 | 108 | CCLE:SKIN 1 | 1 | 1 | CCLE:SMALL_INTESTINE 69 | 69 | 69 | CCLE:SOFT_TISSUE 45 | 45 | 45 | CCLE:STOMACH 3 | 3 | 3 | CCLE:TESTIS 19 | 19 | 19 | CCLE:THYROID 45 | 45 | 45 | CCLE:UPPER_AERODIGESTIVE_TRACT 38 | 38 | 38 | CCLE:URINARY_TRACT 4 | 4 | 4 | CCLE:UVEA 1 | 1 | 1 | CCLE:VULVA 18004 | 18004 | 18004 | FM-AD 2525 | 1892 | 897 | TARGET-AML 40 | 27 | 13 | TARGET-CCSK 1854 | 1686 | 832 | TARGET-NBL 542 | 452 | 265 | TARGET-OS 255 | 134 | 75 | TARGET-RT 998 | 785 | 651 | TARGET-WT 793 | 184 | 92 | TCGA-ACC 4082 | 845 | 412 | TCGA-BLCA 14483 | 2293 | 1098 | TCGA-BRCA 3055 | 619 | 307 | TCGA-CESC 335 | 115 | 51 | TCGA-CHOL 8442 | 990 | 461 | TCGA-COAD 560 | 116 | 58 | TCGA-DLBC 2053 | 377 | 185 | TCGA-ESCA 12169 | 1181 | 617 | TCGA-GBM 5696 | 1123 | 528 | TCGA-HNSC 808 | 226 | 113 | TCGA-KICH 9175 | 1100 | 537 | TCGA-KIRC 3494 | 614 | 291 | TCGA-KIRP 1774 | 697 | 200 | TCGA-LAML 5401 | 1050 | 516 | TCGA-LGG 3818 | 796 | 377 | TCGA-LIHC 7113 | 1301 | 585 | TCGA-LUAD 7401 | 1082 | 504 | TCGA-LUSC 749 | 174 | 87 | TCGA-MESO 14530 | 1210 | 605 | TCGA-OV 2114 | 377 | 185 | TCGA-PAAD 1659 | 366 | 179 | TCGA-PCPG 5013 | 1063 | 500 | TCGA-PRAD 3159 | 351 | 172 | TCGA-READ 2606 | 530 | 261 | TCGA-SARC 4735 | 945 | 470 | TCGA-SKCM 5072 | 940 | 443 | TCGA-STAD 1326 | 306 | 150 | TCGA-TGCT 5475 | 1053 | 507 | TCGA-THCA 883 | 250 | 124 | TCGA-THYM 7306 | 1128 | 560 | TCGA-UCEC 570 | 114 | 57 | TCGA-UCS 700 | 160 | 80 | TCGA-UVM 15598 | 15598 | 752 | gtex (80 rows)

adamstruck commented 6 years ago

The COCACluster vertices/edges should probably not be loaded into the graph. That was a data source we were thinking of using for CEDAR.

adamstruck commented 6 years ago

I think the disconnected ProteinFor edges are pointing to Transcript records on scaffolds. If that is the case, this issue could be resolved by changing the source GFF3 for the ensembl transform to ftp://ftp.ensembl.org/pub/grch37/update/gff3/homo_sapiens/Homo_sapiens.GRCh37.87.chr_patch_hapl_scaff.gff3.gz. I will run this later today to confirm.

adamstruck commented 6 years ago

Switching to Homo_sapiens.GRCh37.87.chr_patch_hapl_scaff.gff3.gz reduced the number of unconnected edges from 21500 to 17665. It looks like the issue here is that the source used to generate ProteinFor edges is for GRCh38.

Here is one example:

{"From":"Protein:ENSP00000483438","Gid":"(Protein:ENSP00000483438)--ProteinFor->(Transcript:ENST00000619796)","Label":"ProteinFor","To":"Transcript:ENST00000619796","level":"error","msg":"To does not exist","time":"2018-10-11T11:27:52-07:00"}
bwalsh commented 6 years ago

Drugs and diseases that have no official ontology

Compounds

$ cat  outputs/compound/normalized.Compound.Vertex.json | jq -r .gid | grep ONTOLOGY | sed s/Compound:NO_ONTOLOGY~//  | sort
681640
81C6
81c6
9 immunoamino camptnetecin
9AC 9 Aminocamplotecian
A202171 Protocol
A4QN
ACT PEP3 KLH
AE 788
ATTAC
AZD
Anti necplatens
BCG
BX796
CAI (NABTT 9712)
CAI (NABTT 97212)
CAI NABIT 9712
CCNG
CMK
Chemo, Multi-Agent, NOS
Chemo, NOS
FMK
GOG 218
GOG182
Genentech Cpd 10
Gliadle Wafer
HG-5-113-01
HG-5-88-01
High dose ara-c
ICT-107
IL 12
IL 13
IOX2
JQ1 (1)
JQ1 (2)
JQ12
JW-7-24-1
JW-7-52-1
KIN001-236
KIN001-244
KIN001-260
KIN001-266
KIN001-270
Lymphocyte Infusion
MAB I 131
MAB I-131
MAB I131
MABI131
MAGI131-81c6
MAb I-131
MEDT 575
MPS-1-IN-1
MU81C6
Mayo 425-20
NPK76-II-72-1
Not otherwise specified
Not specified
O6B6
O6BG
PC2
PCB
PEP3 KLH
PLFE
Poly LCLC
QL-VIII-58
QL-X-138
QL-XI-92
QL-XII-47
QL-XII-61
QS11
R04929097
RBBX 01
Ras Inhibitor
Rhumab/Ind
SB52334
SCH 58500
SCH63666
SCH6636
Sovatenib
TL-1-85
TL-2-105
Temozolomoide
VNLG/124
Vamydex
WZ-1-84
WZ3105
XMD11-85h
XMD13-2
XMD14-99
XMD15-27
XMD8-85
ZG-10
anti neopastons
ch81c6
dcVax
mu81c6
rec MAGE 3-AS + AS15 ACS1 / Placebo Vaccine
rec MAGE3-AS+AS15 ASCI vs Placebo
recMAGE3-AS+AS15 ASCI/Placebo vaccine
recPRAME+AS15
recPRAME+AS15 ASCI

Phenotypes

$ cat  outputs/phenotype/normalized.Phenotype.Vertex.json | jq -r .gid | grep ONTOLOGY | sed s/Phenotype:NO_ONTOLOGY~//  | sort
Ewings_sarcoma-peripheral_primitive_neuroectodermal_tumour
giant_cell_tumour
immortalized_epithelial
immortalized_fibroblast
rhabdoid_tumour
adamstruck commented 6 years ago

Remaining items are being tracked in specific issues.