Open anubane opened 4 months ago
I still do not have clarity on multiple aspects:
all_gene_disease_associations.tsv
file; how was it created?Columns I extracted --> from disgenet_2020.db | all_gene_disease_associations.tsv |
disease_associations.tsv |
gene_associations.tsv |
Columns I extracted |
---|---|---|---|---|
geneId | geneId | geneID | ||
geneSymbol | geneSymbol | |||
geneName | ||||
geneDescription | ||||
DSI | DSI | DSI | ||
DPI | DPI | DPI | ||
diseaseId | diseaseId | diseaseId | ||
diseaseName | diseaseName | diseaseName | ||
diseaseType | diseaseType | diseaseType | ||
diseaseClass | diseaseClass | diseaseClass | ||
diseaseSemanticType | ||||
diseaseClassName | ||||
association | ||||
associationType | ||||
score | score | |||
EI | EI | |||
EL | ||||
year | ||||
YearInitial | ||||
YearFinal | ||||
pmid | ||||
NofPmids | NofPmids | NofPmids | ||
NofSnps | ||||
source | source | |||
NofGenes | ||||
diseaseSemanticType | ||||
PLI | ||||
protein_class_name | ||||
protein_class | ||||
NofDiseases |
Thank you for reaching out!
Regarding the data gathering, GDAs for the ten diseases were gathered from the NIAPU repo: https://github.com/AndMastro/NIAPU/.
I still do not have clarity on multiple aspects:
- The data available on DisGeNet seems different from the
all_gene_disease_associations.tsv
file; how was it created?
The all_gene_disease_associations.tsv
file was downloaded from the DisGeNET website when they were offering different file formats. At the moment, the data they provide are different from the one we used.
- The features for the input data to the NN, how are they created; in which code file?
The features were also gathered from the NIAPU repository.
It is possible to generate such features using the code provided in the aforementioned repo for additional diseases following the instructions provided in the repo itself.
[ ] Data collection and preprocessing
- [ ] Collect GDA from DisGeNet
- [ ] Perform pre-processing to separate 10 diseases, retain diseases with high seed (human) gene count that are also in BioGRID,
The majority of the data we used were gathered from the NIAPU repository. The seed genes files for the 10 diseases were obtained with preprocessing from the file curated_gene_disease_associations.tsv
(found in the Datasets
folder). Like the all_gene_disease_associations.tsv
file, this seems to be no longer available from DisGeNET. The preprocessing of the BioGRID PPI is performed using the script CreateGraph.py
. We are currently working on simplifying and rendering this step more straightforward.
[ ] NIAPU pseudo label assignment
- [ ] construct gene similarity matrix
- [ ] remove edges with weak connections
- [ ] assign initial probabilities
- [ ] Markovian diffusion
- [ ] stationary ditribution of remaining pseudo labels
These steps can be performed using the code provided in the NIAPU repository. If you want to replicate our experiments, it is enough to use the provided feature files. Otherwise, you can use NIAPU to generate features for your diseases.
For the rest of the code/workflow, we are currently working on simplifying the codebase. Stay tuned 😎
As I set about to independently implement the concept presented in the paper titled, XGDAG: explainable gene–disease associations via graph neural networks, I realised that the official codebase has way too many experimental setups spread across vaious files.
To undergo the process of independently re-implementing the paper, to recreate the results, I am currently following the below step by step workflow:
For this implementation workflow, I need to map the files, classes, functions in the actual source code.
Thus, I request you to add the necessary documentation.