ETL CTD gene-disease relationships into KG2

saramsey commented 4 years ago

@zheng-liu reports that there are 77.7 million (!) gene-disease relationships in CTD.

zheng-liu commented 4 years ago

This astronomic number comes from the fact that CTD database collected both curated and inferred gene-disease relations. Please refer the following documentation from their official web:

Gene–Disease Associations Gene-disease associations may be inferred via curated chemical-gene and chemical-disease associations.CTD contains curated and inferred gene–disease associations. Curated gene–disease associations are extracted from the published literature by CTD biocurators, or are derived from the OMIM database using the mim2gene file from the NCBI Gene database. Inferred associations (see figure) are established via CTD–curated chemical–gene interactions (e.g., gene A is associated with disease B because gene A has a curated interaction with chemical C, and chemical C has a curated association with disease B). Curated and inferred associations are identified, and help users develop hypotheses about mechanisms underlying environmental diseases.

zheng-liu commented 4 years ago

My understanding is:

We definitely should involve all the curated gene-disease relations, since they are verified from other researchers.
For the inferred links, we cautiously accumulate according to the inference score (e.g. score >= 8.0) for each entry. So that we can have a moderately-scaled gene-disease relation set, considering both accuracy and computational cost.

ecwood commented 4 years ago

@zheng-liu Can you please post link to the source with 77.7 million gene to disease relationships? On http://ctdbase.org/about/dataStatus.go, I found: Also, I noticed you recommended adding inferred edges with an inference score greater or equal to 8.0. Could you please explain the significance of 8.0? Finally, below is the header and first association in the TSV download: I noticed there is no predicate label. What sort of a relation would this fall under? (biolink:gene_to_disease_association_subject?)

Thank you!

saramsey commented 3 years ago

Hi @zheng-liu could you please respond to Erica's question when you have time? Thanks.

zheng-liu commented 3 years ago

@ericawood, yes, I noticed the gene-diseases stats provided by CTD database is about 27MM, however the number of entries from the file CTD_genes_diseases.tsv is 83MM for now (previously 77MM). They may have extra criteria to filter the gene-disease relation.

Due to the massive size of the gene-disease relation, I tried to filter and select based on the following principles:

The curated ones must be included since they have been verified.
We include an adequate size of inferred gene-disease relations with the score (InferenceScore column) and remain a quantity of relations as in our desire. I remembered the threshold 8.0 is the one I felt suitable for this task. We may change this filter due to the newly-added updates of the database. So basically the score 8.0 is not a gold-standard score, but a filtering threshold we temporarily design to accommodate this massive dataset.

For the question, what sort of the relation this fall under, are you asking gene--[which edge type?]--disease? I guess since this file contains so many gene-disease relation entries, they may have multiple edge types. The column DirectEvidence might give us some clue.

RTXteam / RTX-KG2

ETL CTD gene-disease relationships into KG2 #39