bio-ontology-research-group / mowl

mOWL: Machine Learning library with Ontologies
BSD 3-Clause "New" or "Revised" License
54 stars 4 forks source link

Is there any way to run generate embeddings from my corpus ,not the built-in dataset? #27

Closed CNwangbin closed 2 years ago

CNwangbin commented 2 years ago

I want to use my data as corpus.

ferzcam commented 2 years ago

Hi, you can bring your own .owl files and turn them into mOWL datasets using:

from mowl.datasets.base import PathDataset
ds = PathDataset("training_ontology.owl", "validation_ontology.owl", "testing_ontology.owl")

The validation and testing owl files are optional. For more details on how to add information to an ontology please refer to this example.

CNwangbin commented 2 years ago

Yes, thanks. I found that.     ------------------ Original ------------------ From: "Fernando Zhapa"; Date: 2022年9月17日(星期六) 晚上7:28 To: "bio-ontology-research-group/mowl"; Cc: @.***>; "Author"; Subject: Re: [bio-ontology-research-group/mowl] Is there any way to run generate embeddings from my corpus ,not the built-in dataset? (Issue #27)

 

Hi, you can bring your own .owl files and turn them into mOWL datasets using: from mowl.datasets.base import PathDataset ds = PathDataset("training_ontology.owl", "validation_ontology.owl", "testing_ontology.owl")
The validation and testing owl files are optional. For more details on how to add information to an ontology please refer to this example.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

CNwangbin commented 2 years ago

I have a new question. After training, I got a embeddings generator which corresponding to a certain class. My  question is "how to get the correct classes order  for the embeddings generator" .   ------------------ Original ------------------ From: @.>; Date:  Sat, Sep 17, 2022 07:28 PM To: @.>; Cc: @.>; @.>; Subject:  Re: [bio-ontology-research-group/mowl] Is there any way to run generate embeddings from my corpus ,not the built-in dataset? (Issue #27)

 

Hi, you can bring your own .owl files and turn them into mOWL datasets using: from mowl.datasets.base import PathDataset ds = PathDataset("training_ontology.owl", "validation_ontology.owl", "testing_ontology.owl")
The validation and testing owl files are optional. For more details on how to add information to an ontology please refer to this example.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

ferzcam commented 2 years ago

The embeddings contents is usually a dictionary of the form class name -> embedding vector. This would apply for methods such as Word2Vec. Are you using that one or other different?

CNwangbin commented 2 years ago

Yes,I found that in tutorial examples. The new question is how to find classes order corresponding to embedding generator order.

    ------------------ 原始邮件 ------------------ 发件人: "Fernando Zhapa"; 发送时间: 2022年9月17日(星期六) 晚上8:33 收件人: "bio-ontology-research-group/mowl"; 抄送: @.***>; "Author"; 主题: Re: [bio-ontology-research-group/mowl] Is there any way to run generate embeddings from my corpus ,not the built-in dataset? (Issue #27)

 

The embeddings contents is usually a dictionary of the form class name -> embedding vector. This would apply for methods such as Word2Vec. Are you using that one or other different?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

ferzcam commented 2 years ago

You can access the classes in the dataset doing:

dataset = # Assume this is a mowl dataset
classes  = dataset.classes.as_str

That is a list of classes from which you can generate a dictionary:

class_to_id = {v:k for k,v in enumerate(classes)}

Please let me know if this helps.

CNwangbin commented 2 years ago

Yes, thx. But there are some question remain here. The number of classes should equal to the number of vectors. Is there any other misstake? image

CNwangbin commented 2 years ago

``

Yes, thx. But there are some question remain here. The number of classes should equal to the number of vectors. Is there any other misstake? image

this is my code. `import mowl mowl.init_jvm("4g") from mowl.datasets.base import PathDataset

ds = PathDataset("go.owl") from mowl.projection.dl2vec.model import DL2VecProjector projector = DL2VecProjector(bidirectional_taxonomy = True) edges = projector.project(ds.ontology) from mowl.walking.factory import walker_factory walker = walker_factory("deepwalk", alpha = 0.1, walk_length = 10, num_walks = 10, outfile = "data/walks/walk.txt") walker.walk(edges) from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence

corpus = LineSentence(walker.outfile)

w2v_model = Word2Vec( corpus, sg=1, min_count=1, vector_size=10, window = 10, epochs = 10, workers = 16)

`

ferzcam commented 2 years ago

There is no mistake on your code, but there are classes in the ontology that are not being captured by the projection method (DL2VecProjector) because those classes are obsolete, deprecated or are part of axioms that cannot be processed by the projection method.

As an example check http://purl.obolibrary.org/obo/GO_1901916 That is part of ds.classes.as_str but not part of w2v_model.wv (I tried with the last version of GO at this time) and appears as OBSOLETE in the go.owl file.

Considering obsolete classes in the function dataset.classes.as_str will be considered a bug and will be fixed in future versions.

If you need further help, please let us know. Thanks.

CNwangbin commented 2 years ago

Thanks, it's clear for me. But I just want to get the w2v_model.wv classes order, not numerical vector only. Is there any way now?

ferzcam commented 2 years ago

Since you are working with Gensim's Word2Vec model, would w2v_model.wv.key_to_index work?

CNwangbin commented 2 years ago

thanks you, ferzcam. It works well. And is that right way to get (class, vector) pairs?

image

ferzcam commented 2 years ago

I think that is right. There is also this way:

for class in w2v_model.wv.index_to_key:
    vector = in w2v_model.wv[class]
    print(class)
    print(vector)
CNwangbin commented 2 years ago

OK, that is good.