GiDeCarlo / XGDAG

4 stars 3 forks source link

[Feature Request]: Request for addition of implementation documentation #1

Open anubane opened 4 months ago

anubane commented 4 months ago

As I set about to independently implement the concept presented in the paper titled, XGDAG: explainable gene–disease associations via graph neural networks, I realised that the official codebase has way too many experimental setups spread across vaious files.

To undergo the process of independently re-implementing the paper, to recreate the results, I am currently following the below step by step workflow:

For this implementation workflow, I need to map the files, classes, functions in the actual source code.

Thus, I request you to add the necessary documentation.

anubane commented 3 months ago

I still do not have clarity on multiple aspects:

  1. The data available on DisGeNet seems different from the all_gene_disease_associations.tsv file; how was it created?
  2. The features for the input data to the NN, how are they created; in which code file?
anubane commented 3 months ago
Columns I extracted --> from disgenet_2020.db all_gene_disease_associations.tsv disease_associations.tsv gene_associations.tsv Columns I extracted
geneId geneId geneID
geneSymbol geneSymbol
geneName
geneDescription
DSI DSI DSI
DPI DPI DPI
diseaseId diseaseId diseaseId
diseaseName diseaseName diseaseName
diseaseType diseaseType diseaseType
diseaseClass diseaseClass diseaseClass
diseaseSemanticType
diseaseClassName
association
associationType
score score
EI EI
EL
year
YearInitial
YearFinal
pmid
NofPmids NofPmids NofPmids
NofSnps
source source
NofGenes
diseaseSemanticType
PLI
protein_class_name
protein_class
NofDiseases
AndMastro commented 3 months ago

Thank you for reaching out!

Regarding the data gathering, GDAs for the ten diseases were gathered from the NIAPU repo: https://github.com/AndMastro/NIAPU/.

I still do not have clarity on multiple aspects:

  1. The data available on DisGeNet seems different from the all_gene_disease_associations.tsv file; how was it created?

The all_gene_disease_associations.tsv file was downloaded from the DisGeNET website when they were offering different file formats. At the moment, the data they provide are different from the one we used.

  1. The features for the input data to the NN, how are they created; in which code file?

The features were also gathered from the NIAPU repository.

It is possible to generate such features using the code provided in the aforementioned repo for additional diseases following the instructions provided in the repo itself.

  • [ ] Data collection and preprocessing

    • [ ] Collect GDA from DisGeNet
    • [ ] Perform pre-processing to separate 10 diseases, retain diseases with high seed (human) gene count that are also in BioGRID,

The majority of the data we used were gathered from the NIAPU repository. The seed genes files for the 10 diseases were obtained with preprocessing from the file curated_gene_disease_associations.tsv (found in the Datasets folder). Like the all_gene_disease_associations.tsv file, this seems to be no longer available from DisGeNET. The preprocessing of the BioGRID PPI is performed using the script CreateGraph.py. We are currently working on simplifying and rendering this step more straightforward.

  • [ ] NIAPU pseudo label assignment

    • [ ] construct gene similarity matrix
    • [ ] remove edges with weak connections
    • [ ] assign initial probabilities
    • [ ] Markovian diffusion
    • [ ] stationary ditribution of remaining pseudo labels

These steps can be performed using the code provided in the NIAPU repository. If you want to replicate our experiments, it is enough to use the provided feature files. Otherwise, you can use NIAPU to generate features for your diseases.

For the rest of the code/workflow, we are currently working on simplifying the codebase. Stay tuned 😎