[Feature Request]: Request for addition of implementation documentation

anubane commented 4 months ago

As I set about to independently implement the concept presented in the paper titled, XGDAG: explainable gene–disease associations via graph neural networks, I realised that the official codebase has way too many experimental setups spread across vaious files.

To undergo the process of independently re-implementing the paper, to recreate the results, I am currently following the below step by step workflow:

[ ] Data collection and preprocessing
- [ ] Collect GDA from DisGeNet
- [ ] Perform pre-processing to separate 10 diseases, retain diseases with high seed (human) gene count that are also in BioGRID,
[ ] NIAPU pseudo label assignment
- [ ] construct gene similarity matrix
- [ ] remove edges with weak connections
- [ ] assign initial probabilities
- [ ] Markovian diffusion
- [ ] stationary ditribution of remaining pseudo labels
[ ] Train a Graph Neural Network
- [ ] Setup dataloader (tensor batches)
- [ ] prepare loss function, optimizer, hyperparameters
- [ ] Design the model architecture (7 layer GraphSAGE, no sampling)
[ ] Explaining the learnt graph
- [ ] for each positive label gene, extract a subgraph
- [ ] only retain likely positive genes in these subgraphs
- [ ] calculate (gene occurence count across candidate subgraphs, cumulative importance score of gene across subgraphs) -> $$(M_i, S_i)$$
- [ ] sort and rank genes

For this implementation workflow, I need to map the files, classes, functions in the actual source code.

Thus, I request you to add the necessary documentation.

anubane commented 3 months ago

I still do not have clarity on multiple aspects:

The data available on DisGeNet seems different from the all_gene_disease_associations.tsv file; how was it created?
The features for the input data to the NN, how are they created; in which code file?

anubane commented 3 months ago

Columns I extracted --> from disgenet_2020.db	`all_gene_disease_associations.tsv`	`disease_associations.tsv`	`gene_associations.tsv`
geneId		geneId	geneID
geneSymbol		geneSymbol
			geneName
			geneDescription
DSI		DSI	DSI
DPI		DPI	DPI
diseaseId	diseaseId		diseaseId
diseaseName	diseaseName		diseaseName
diseaseType	diseaseType		diseaseType
diseaseClass	diseaseClass		diseaseClass
diseaseSemanticType
			diseaseClassName
			association
			associationType
score			score
EI			EI
			EL
			year
YearInitial
YearFinal
			pmid
NofPmids	NofPmids	NofPmids
NofSnps
source			source
	NofGenes
	diseaseSemanticType
		PLI
		protein_class_name
		protein_class
		NofDiseases

AndMastro commented 3 months ago

Thank you for reaching out!

Regarding the data gathering, GDAs for the ten diseases were gathered from the NIAPU repo: https://github.com/AndMastro/NIAPU/.

I still do not have clarity on multiple aspects:

The data available on DisGeNet seems different from the all_gene_disease_associations.tsv file; how was it created?

The all_gene_disease_associations.tsv file was downloaded from the DisGeNET website when they were offering different file formats. At the moment, the data they provide are different from the one we used.

The features for the input data to the NN, how are they created; in which code file?

The features were also gathered from the NIAPU repository.

It is possible to generate such features using the code provided in the aforementioned repo for additional diseases following the instructions provided in the repo itself.

[ ] Data collection and preprocessing

[ ] Collect GDA from DisGeNet

[ ] Perform pre-processing to separate 10 diseases, retain diseases with high seed (human) gene count that are also in BioGRID,

The majority of the data we used were gathered from the NIAPU repository. The seed genes files for the 10 diseases were obtained with preprocessing from the file curated_gene_disease_associations.tsv (found in the Datasets folder). Like the all_gene_disease_associations.tsv file, this seems to be no longer available from DisGeNET. The preprocessing of the BioGRID PPI is performed using the script CreateGraph.py. We are currently working on simplifying and rendering this step more straightforward.

[ ] NIAPU pseudo label assignment

[ ] construct gene similarity matrix

[ ] remove edges with weak connections

[ ] assign initial probabilities

[ ] Markovian diffusion

[ ] stationary ditribution of remaining pseudo labels

These steps can be performed using the code provided in the NIAPU repository. If you want to replicate our experiments, it is enough to use the provided feature files. Otherwise, you can use NIAPU to generate features for your diseases.

For the rest of the code/workflow, we are currently working on simplifying the codebase. Stay tuned 😎

GiDeCarlo / XGDAG

[Feature Request]: Request for addition of implementation documentation #1