angelosalatino / cso-classifier

Python library that classifies content from scientific papers with the topics of the Computer Science Ontology (CSO).
https://cso.kmi.open.ac.uk
Apache License 2.0
85 stars 18 forks source link

How to apply to a different ontology/domain? #7

Open innerop opened 5 years ago

innerop commented 5 years ago

Very useful and great work.

How do I use a different ontology from a different domain? I can replicate the format used in the current CS ontology, but what about the cached model? Is that a generalized model or specific to the ontology. If the latter, how do I go about constructing one for a different ontology?

Many thanks and happy to share back the results of my work

EDIT:

Learning about word2vec... but would love to hear from you anyway, if you have any tips or instructions.

Thank you.

angelosalatino commented 5 years ago

Hi, these are very good questions. I will soon write an article/tutorial/guide on my blog on how to move towards other domains of science. Stay tuned

innerop commented 5 years ago

@angelosalatino

That would help greatly in adopting and adapting this work.

For now, however, could you please provide the script that generates the token-to-cso-combined file?

The README is clear on what is involved but looking at the CSO I have no clue what constitutes a "topic" The "words" (1,2,3-gram entities) show up in so many places. I have no idea how to even query the CSO properly? Do I use SPARQL? is this RDF? RDFS? I'm completely new to the format.

Referring to this passage in README.MD:

To generate this file, we collected all the set of words available within the vocabulary of the model. Then iterating on each word, we retrieved its top 10 similar words from the model, and we computed their Levenshtein similarity against all CSO topics. If the similarity was above 0.7, we created a record which stored all CSO topics triggered by the initial word.

angelosalatino commented 5 years ago

Hi, we wrote an article explaining how you can adopt the CSO Classifier in other fields: https://infernusweb.altervista.org/wp/how-to-use-the-cso-classifier-in-other-domains/

Please do let us know if you need further information.

innerop commented 5 years ago

Thank you and I’ll keep you in the loop on how I’m using it and any improvements I can think of or further questions.

I managed to find an older version prior to when you added the cache and I could see how you’re doing the matching against ontology with the embeddings so that was very educational. One note, however, is that the older version only works on Python 3.6, not 3.7 or later. It throws a StopIteration exception from NLTK util. That’s an issue with Python and NLTK not your codebase

Thank you 🙏 .

innerop commented 5 years ago

@angelosalatino

I looked at the code for generating the file which you shared in the article.

I'd like to point out the divergence I see with respect to the description given in the article.

The description says:

"To generate this dictionary/file, we collected all the different words available within the vocabulary of the model. Then iterating on each word, we retrieved its top 10 similar words from the model, and we computed their Levenshtein similarity against all CSO topics. If the similarity was above 0.7, we created a record which stored all CSO topics triggered by the initial word."

But I believe the code does this instead:

"To generate this dictionary/file, we collected all the different words available within the vocabulary of the model. Then iterating on each word, we retrieved its top 10 similar words from the model and put them in a list, which we iterated over. If the cosine similarity for a word in the list was equal to or greater than 0.7, and we computed its Levenshtein similarity against all CSO topics and where that was equal to or above 0.94 we added the topic to a record (or created it if it didn't exist) which stored all CSO topics triggered by the initial word from our model."

angelosalatino commented 5 years ago

Hi, yes. Your explanation is very detailed. We left some details out for the sake of the narrative and demanded the reader to the code for further details. But definitely. Your description fits 100% with the actual process.

Thanks