Open saramsey opened 4 years ago
This astronomic number comes from the fact that CTD database collected both curated and inferred gene-disease relations. Please refer the following documentation from their official web:
Gene–Disease Associations Gene-disease associations may be inferred via curated chemical-gene and chemical-disease associations.CTD contains curated and inferred gene–disease associations. Curated gene–disease associations are extracted from the published literature by CTD biocurators, or are derived from the OMIM database using the mim2gene file from the NCBI Gene database. Inferred associations (see figure) are established via CTD–curated chemical–gene interactions (e.g., gene A is associated with disease B because gene A has a curated interaction with chemical C, and chemical C has a curated association with disease B). Curated and inferred associations are identified, and help users develop hypotheses about mechanisms underlying environmental diseases.
My understanding is:
@zheng-liu Can you please post link to the source with 77.7 million gene to disease relationships? On http://ctdbase.org/about/dataStatus.go, I found: Also, I noticed you recommended adding inferred edges with an inference score greater or equal to 8.0. Could you please explain the significance of 8.0? Finally, below is the header and first association in the TSV download: I noticed there is no predicate label. What sort of a relation would this fall under? (biolink:gene_to_disease_association_subject?)
Thank you!
Hi @zheng-liu could you please respond to Erica's question when you have time? Thanks.
@ericawood, yes, I noticed the gene-diseases stats provided by CTD database is about 27MM, however the number of entries from the file CTD_genes_diseases.tsv
is 83MM for now (previously 77MM). They may have extra criteria to filter the gene-disease relation.
Due to the massive size of the gene-disease relation, I tried to filter and select based on the following principles:
InferenceScore
column) and remain a quantity of relations as in our desire. I remembered the threshold 8.0 is the one I felt suitable for this task. We may change this filter due to the newly-added updates of the database. So basically the score 8.0 is not a gold-standard score, but a filtering threshold we temporarily design to accommodate this massive dataset.For the question, what sort of the relation this fall under, are you asking gene--[which edge type?]--disease
? I guess since this file contains so many gene-disease relation entries, they may have multiple edge types. The column DirectEvidence
might give us some clue.
@zheng-liu reports that there are 77.7 million (!) gene-disease relationships in CTD.