RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
38 stars 8 forks source link

ETL CTD gene-disease relationships into KG2 #39

Open saramsey opened 4 years ago

saramsey commented 4 years ago

@zheng-liu reports that there are 77.7 million (!) gene-disease relationships in CTD.

zheng-liu commented 4 years ago

This astronomic number comes from the fact that CTD database collected both curated and inferred gene-disease relations. Please refer the following documentation from their official web:

Gene–Disease Associations Gene-disease associations may be inferred via curated chemical-gene and chemical-disease associations.CTD contains curated and inferred gene–disease associations. Curated gene–disease associations are extracted from the published literature by CTD biocurators, or are derived from the OMIM database using the mim2gene file from the NCBI Gene database. Inferred associations (see figure) are established via CTD–curated chemical–gene interactions (e.g., gene A is associated with disease B because gene A has a curated interaction with chemical C, and chemical C has a curated association with disease B). Curated and inferred associations are identified, and help users develop hypotheses about mechanisms underlying environmental diseases.

zheng-liu commented 4 years ago

My understanding is:

ecwood commented 4 years ago

@zheng-liu Can you please post link to the source with 77.7 million gene to disease relationships? On http://ctdbase.org/about/dataStatus.go, I found: image Also, I noticed you recommended adding inferred edges with an inference score greater or equal to 8.0. Could you please explain the significance of 8.0? Finally, below is the header and first association in the TSV download: image I noticed there is no predicate label. What sort of a relation would this fall under? (biolink:gene_to_disease_association_subject?)

Thank you!

saramsey commented 3 years ago

Hi @zheng-liu could you please respond to Erica's question when you have time? Thanks.

zheng-liu commented 3 years ago

@ericawood, yes, I noticed the gene-diseases stats provided by CTD database is about 27MM, however the number of entries from the file CTD_genes_diseases.tsv is 83MM for now (previously 77MM). They may have extra criteria to filter the gene-disease relation.

Due to the massive size of the gene-disease relation, I tried to filter and select based on the following principles:

For the question, what sort of the relation this fall under, are you asking gene--[which edge type?]--disease? I guess since this file contains so many gene-disease relation entries, they may have multiple edge types. The column DirectEvidence might give us some clue.