bio-ontology-research-group / mowl

mOWL: Machine Learning library with Ontologies
BSD 3-Clause "New" or "Revised" License
55 stars 4 forks source link

It seems that some terms were lost while implementing KGE on Ontologies. #62

Closed CNwangbin closed 10 months ago

CNwangbin commented 1 year ago

Describe the bug

When I ran the custom go.owl file using the example TransE code, it seems that some terms were lost.

How to reproduce

There are mowl wrapped code as follows.

` import mowl mowl.init_jvm("20g") from mowl.projection.edge import Edge from mowl.projection import TaxonomyProjector

from mowl.datasets.base import PathDataset

dataset = PathDataset("go_cafa3.owl")

from mowl.models import GraphPlusPyKEENModel from mowl.projection import DL2VecProjector from pykeen.models import TransE import torch as th

model = GraphPlusPyKEENModel(dataset) model.set_projector(DL2VecProjector()) model.set_kge_method(TransE, random_seed=42) model.optimizer = th.optim.Adam model.lr = 0.001 model.batch_size = 32 model.train(epochs = 1)

class_embs = model.class_embeddings role_embs = model.object_property_embeddings ind_embs = model.individual_embeddings

terms = [] vectors = [] for i,word in enumerate(class_embs): vector = class_embs[word] items = word.split('/') if len(items) > 1: word = items[-1] if word.startswith('GO') and not word.endswith('>'): term = items[-1] terms.append(term) vectors.append(vector)

'GO:0005926' in terms `

False

But GO_0005926 found in owl file like " true</owl:deprecated> </owl:Class> ...".

It also occured in pykeen version code like:

` import mowl mowl.init_jvm("20g") from mowl.projection.edge import Edge from mowl.datasets.builtin import PPIYeastSlimDataset from mowl.projection import TaxonomyProjector

from mowl.datasets.base import PathDataset

dataset = PathDataset("go.owl")

proj = TaxonomyProjector(True)

edges = proj.project(dataset.ontology)

edges = [Edge("node1", "rel1", "node3"), Edge("node5", "rel2", "node1"), Edge("node2", "rel1", "node1")] # example of edges

triples_factory = Edge.as_pykeen(edges, create_inverse_triples = True)

from pykeen.models import TransE pk_model = TransE(triples_factory=triples_factory, embedding_dim = 50, random_seed=42) from mowl.kge import KGEModel

model = KGEModel(triples_factory, pk_model, epochs = 1, batch_size = 32) model.train() ent_embs = model.class_embeddings_dict rel_embs = model.object_property_embeddings_dict

terms = [] vectors = [] for i,word in enumerate(ent_embs): vector = ent_embs[word] items = word.split('/') if len(items) > 1: word = items[-1] if word.startswith('GO') and not word.endswith('>'): term = items[-1] terms.append(term) vectors.append(vector)

'GO_0005926' in terms `

False

And it can be observed that when running the code `proj = TaxonomyProjector(True)

edges = proj.project(dataset.ontology)

edges = [Edge("node1", "rel1", "node3"), Edge("node5", "rel2", "node1"), Edge("node2", "rel1", "node1")] # example of edges

triples_factory = Edge.as_pykeen(edges, create_inverse_triples = True)`, it shows "INFO: Number of ontology classes: 50119", but the final len(terms) is only 42819. Is it because the outdated terms were discarded?

Environment

OS information

NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"

Python version

Python=3.8.13

mOWL version

mowl-borg==0.2.0

JDK version

openjdk 17.0.3-internal 2022-04-19

Additional information

If I need to use embeddings for outdated terms, how should I proceed?

ferzcam commented 1 year ago

Hi. Thanks for your reporting this issue. I think the outdated terms are not being considered when generating the graph-projection for the ontology. In that case, I can suggest two solutions: (1) modify the ontology beforehand and add the outdated terms by yourself or (2) modify the source code, for which you should look at this line where the axioms are retrieved. In terms of (1), here in Section 5.5 mentions that the boolean value true in the owl:deprecated annotation indicates deprecation. So maybe changing to false can help you. I hope this helps and if you have additional questions, let me know.

CNwangbin commented 1 year ago

Because manually modifying the Ontology file is very cumbersome, I think method (2) is more elegant. I changed the code of line28 True to False. I reinstalled the software and ran my code, but it doesn't seem to have taken effect for both wrapped and pykeen ways.

leechuck commented 11 months ago

I am not sure that deprecated classes can meaningfully be integrated. Once deprecated in GO, they will be removed from all axioms, therefore no edges will be created that connect to them. They would be disconnected nodes. Are you sure you want to apply any kind of embedding or learning process to these? There is nothing that can be learned from these classes if they are not used in axioms.

CNwangbin commented 10 months ago

I am not sure that deprecated classes can meaningfully be integrated. Once deprecated in GO, they will be removed from all axioms, therefore no edges will be created that connect to them. They would be disconnected nodes. Are you sure you want to apply any kind of embedding or learning process to these? There is nothing that can be learned from these classes if they are not used in axioms.

Thanks.