cthoyt / obo-foundry-graph

Demonstrate combining all OBO Foundry ontologies via Bioregistry, Bioontologies, and ROBOT
MIT License
1 stars 0 forks source link

TransE embedding visualizations #2

Open LucaCappelletti94 opened 2 years ago

LucaCappelletti94 commented 2 years ago

While I cannot share the node embedding (for now) since I am gzipping it to upload it on internet archive, in the meantime I can share some visualizations on the TransE embedding made with the GraPE library.

image TSNE decomposition and properties distribution of the OBO Foundry graph using the TransE node embedding: (a) Node degrees heatmap. (b) Detected node ontologies: 'NCBI Taxonomy' in blue, 'Unknown' in orange, 'CHEBI' in red, 'Foundational Model of Anatomy Ontology (subset)' in cyan, 'NCBI Gene' in green, 'PRotein Ontology (PRO)' in yellow, 'Gene Ontology' in purple, and Other 51 ontologies in pink. The node ontologies do not appear to form recognizable clusters (Balanced accuracy: 53.39% ± 4.33%). (c) Connected components: 'Main component' in blue, 'Minor components' in orange, and 'Tuples' in red. The components do not appear to form recognizable clusters (Balanced accuracy: 49.61% ± 7.01%). (d) Existent and non-existent edges: 'Existent' in blue and 'Non-existent' in orange. The existent & non-existent edges form some clusters (Balanced accuracy: 79.03% ± 0.16%). (e) Euclidean distance heatmap. This metric is an outstanding edge prediction feature (Balanced accuracy: 97.58% ± 0.08%). (f) Cosine similarity heatmap. This metric is an outstanding edge prediction feature (Balanced accuracy: 97.59% ± 0.08%). Do note that the cosine similarity has been shifted from the range of [-1, 1] to the range [0, 2] to be visualized in a logarithmic heatmap. (g) Adamic-Adar heatmap. This metric may be considered an edge prediction feature (Balanced accuracy: 56.40% ± 0.08%). (h) Jaccard Coefficient heatmap. This metric may be considered an edge prediction feature (Balanced accuracy: 56.31% ± 0.07%). (j) Preferential Attachment heatmap. This metric may be considered an edge prediction feature (Balanced accuracy: 55.04% ± 0.22%). (k) Resource Allocation Index heatmap. This metric may be considered an edge prediction feature (Balanced accuracy: 56.42% ± 0.10%). (i) Edge types: 'RDFS subClassOf' in blue, 'HTTP //WWW.obofoundry.org/ro/ro.owl#has proper part' in orange, 'Ro 0002160' in red, 'PR#has gene template' in cyan, 'Ro 0000087' in green, 'Bfo 0000050' in yellow, 'Bfo 0000051' in purple, and Other 843 edge types in pink. The edge types do not appear to form recognizable clusters (Balanced accuracy: 18.14% ± 2.30%). (l) Euclidean distance distribution. Euclidean distance values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (m) Cosine similarity distribution. Cosine similarity values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (n) Adamic-Adar distribution. Adamic-Adar values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (o) Jaccard Coefficient distribution. Jaccard Coefficient values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (p) Preferential Attachment distribution. Preferential Attachment values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (q) Resource Allocation Index distribution. Resource Allocation Index values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. In the heatmaps, a, e, f, g, h, j, and k, low and high values appear in red and blue hues, respectively. Intermediate values appear in either a yellow or cyan hue. The values are on a logarithmic scale. The separability considerations for figures b, c, d, e, f, g, h, j, k, and i derive from evaluating a Decision Tree trained on five Monte Carlo holdouts, with a 70/30 split between training and test sets. We have sampled 20.0 thousand existing and 20.0 thousand non-existing edges. We have sampled the non-existent edges' source and destination nodes by avoiding any disconnected nodes present in the graph to avoid biases.

LucaCappelletti94 commented 2 years ago

Here is the TransE embedding.

cthoyt commented 2 years ago

Hi @LucaCappelletti94 this is quite pretty! Thanks for posting the embedding itself. Do you have code that was used to generate this that you want to PR in here?

Two big caveats:

LucaCappelletti94 commented 2 years ago

I absolutely understand that this is a graph with several issues and that TransE may not be the best embedding model for this task, but it still remains a decent lower bound. If we were to see just a gaussian ball instead of a sensible TSNE decomposition I may have been a bit more worried.

Which model would you consider best for this task? I can benchmark it against TransE quite easily and see how much it improves.

I'm heading out to get groceries, I'll write the code here by this evening. Can I consider this graph URL stable, so to add it to ensmallen's automatic retrieval?

LucaCappelletti94 commented 2 years ago

Hopefully with improved versions of the graph we will see also improved embedding, this is just a lower bound.

cthoyt commented 2 years ago

I wouldn't consider this stable. I'll set up zenodo dumping soon

LucaCappelletti94 commented 2 years ago

One note: I just noticed I have inverted the labels of the distances. I will re-run the embedding and get the fixed image in a bit.

LucaCappelletti94 commented 2 years ago

Hello @cthoyt, sorry for the long delay. I got swamped by PhD-related things. I finally got some time to provide the code and the correct images (as soon as they finish rendering).

So, first, install the latest version of 🍇 by using pip install grape. Then, you can load up this repository's graph by using:

from grape import Graph

graph = Graph.from_csv(
    edge_path="graph.tsv",
    sources_column_number=0,
    edge_list_edge_types_column_number=1,
    destinations_column_number=2,
    directed=False,
    name="OBO Foundry"
)

Since this graph is relatively small and does not raise any memory concerns, you can enable the time-memory speedups to get a faster execution. It primarily means that instead of using the default Elias-fano data structure, I switch to either a COO or a CSR, depending on what is needed for a task. By default, when running an enable, it changes to a CSR.

graph.enable()

To get the graph report, display the graph in a jupyter cell or just run print(graph). Consider that the report is in HTML, so it is best used in a jupyter notebook.

For the visualization part, you can run:

from grape import GraphVisualizer

GraphVisualizer(graph).fit_and_plot_all("SkipGram")

By default, this will run my implementation of SkipGram. We support a few other embedding algorithms and, most interestingly, for scientific reproducibility, we have support for wrapping arbitrary third-party libraries. For instance, we have support for PyKeen, and you can run (or, in this case, visualize) quite a few PyKeen models by executing:

GraphVisualizer(graph).fit_and_plot_all(
    embedding_model="BoxE",
    library_name="PyKeen"
)

If you'd like to learn more 🍇-related things, we have created a Telegram group and a (Discord server)[https://discord.gg/Nda2cqYvTN].