RasaHQ / rasa

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
https://rasa.com/docs/rasa/
Apache License 2.0
18.64k stars 4.6k forks source link

Add Semantic Network Support #513

Closed vlordier closed 4 years ago

vlordier commented 7 years ago

Looking at https://concept.research.microsoft.com/Home/Introduction I see how we would highly benefit from using this dataset to better understand short text within its context.

Any pointers on how I would be able to help integrating it ?

wrathagom commented 7 years ago

Can you help me find licensing information for it?

r-wheeler commented 7 years ago

@vlordier @wrathagom

Any thoughts on how to best integrate this data given current methods of intent classification? It looks like the current SVC classifier considers the average of the word vectors. Would the most logical approach be to add the concepts to the bag of words with the assumption that this would aide classification by altering the inputs vector space?

Or is there an alternative approach?

I have been interested in incorporating ontologies into supervised classification a la rasa.

As an example: According to the data

Facebook -> isa -> social medium

One of the probase papers for some interesting background and alternative approaches: https://pdfs.semanticscholar.org/dab3/cb9c7ddca956c55bc14bd8052faee16fad6a.pdf

vlordier commented 7 years ago

All I see is this

References

Please cite following papers if you use our data:

  1. Zhongyuan Wang, Haixun Wang, Ji-Rong Wen, and Yanghua Xiao, An Inference Approach to Basic Level of Categorization, http://research.microsoft.com/apps/pubs/default.aspx?id=255396 in ACM International Conference on Information and Knowledge Management (CIKM), ACM – Association for Computing Machinery, October 2015.
  2. Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Zhu, Probase: A Probabilistic Taxonomy for Text Understanding, http://research.microsoft.com/apps/pubs/default.aspx?id=158737 in ACM International Conference on Management of Data (SIGMOD), May 2012.

Please cite following papers if you use our conceptualization service:

  1. Zhongyuan Wang and Haixun Wang, Understanding Short Texts, http://research.microsoft.com/apps/pubs/default.aspx?id=264862in the Association for Computational Linguistics (ACL) (Tutorial), August 2016.
  2. Zhongyuan Wang, Haixun Wang, Ji-Rong Wen, and Yanghua Xiao, An Inference Approach to Basic Level of Categorization, http://research.microsoft.com/apps/pubs/default.aspx?id=255397 in ACM International Conference on Information and Knowledge Management (CIKM), ACM –Association for Computing Machinery, October 2015.
  3. Zhongyuan Wang, Kejun Zhao, Haixun Wang, Xiaofeng Meng, and Ji-Rong Wen, Query Understanding through Knowledge-Based Conceptualization, http://research.microsoft.com/apps/pubs/default.aspx?id=245007 in IJCAI, July 2015.
  4. Wen Hua, Zhongyuan Wang, Haixun Wang, Kai Zheng, and Xiaofang Zhou,Short Text Understanding Through Lexical-Semantic Analysis, http://research.microsoft.com/apps/pubs/default.aspx?id=231107 in International Conference on Data Engineering (ICDE), April 2015. (Best Paper Award)
  5. Zhongyuan Wang, Haixun Wang, and Zhirui Hu, Head, Modifier, and Constraint Detection in Short Texts, http://research.microsoft.com/apps/pubs/default.aspx?id=203584in International Conference on Data Engineering (ICDE), 2014.
  6. Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, and Weizhu Chen, Short Text Conceptualization using a Probabilistic Knowledgebase, http://research.microsoft.com/apps/pubs/default.aspx?id=151341in IJCAI, 2011.

On Thu, 3 Aug 2017 at 15:25, Caleb M. Keller notifications@github.com wrote:

Can you help me find licensing information for it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/RasaHQ/rasa_nlu/issues/513#issuecomment-319984556, or mute the thread https://github.com/notifications/unsubscribe-auth/AFMONclzUn75llPjRB5pQNdqQzerLqRNks5sUdhkgaJpZM4OsXQk .

vlordier commented 7 years ago

I was thinking of something along those lines https://github.com/mangate/ConvNetSent / https://medium.com/towards-data-science/how-to-do-text-classification-using-tensorflow-word-embeddings-and-cnn-edae13b3e575 Thoughts ?

r-wheeler commented 7 years ago

@vlordier

Not the best to answer as I have just been lurking issues, but there has been some discussion (in previous threads) about integrating keras / TF and it is certainly straight forward enough to hook in a new classifier. The bottle neck tends to be the amount of training data required train on. 1000+ instances per class.

I was specifically wondering the best way to incorporate an external knowledge graph with (any) classifier.

tmbo commented 7 years ago

@r-wheeler I think an integration of a knowledge graph might be quite neat - whats the application you had in mind?

r-wheeler commented 7 years ago

Given that rasa sklearn pipeline is currently using average of vectors, a first pass might be to add the vectors for the concepts from the knowledge graph to the mix. Ie

Given a sentence that has Ipad and apple, use the knowledge graph to find what the entities "are" in an ontological sense and add those vectors to the mix: kb = a function that returns the most relevant knowledge graph result vec = a function to compute the word vector kb('ipad') == 'product' kb('apple') == 'company'

mean(vec(ipad) + vec(kb(ipad)) + vec(apple) + vec(kb(ipad)))

Here you would just be hoping this aids classification and would be easy to compare models using standard machine learning metrics

More complex would be be to change up the feature engineering / model to not just use word vectors one idea might be to use something like an lstm where each timestep is the wordvector and a vector of most relevant knowledge graph entry as a "context" vector for what the word "is".

Trying to aide in What helps the classifier when the text is short? What helps the classifier when their are a huge number of intents?

vlordier commented 6 years ago

We might need to use an alternative to MS concepts though, as their license is "research only" : ConceptNet might help instead ?

On 14 September 2017 at 16:19, Ryan Wheeler notifications@github.com wrote:

Given that rasa sklearn pipeline is currently using average of vectors, a first pass might be to add the vectors for the concepts from the knowledge graph to the mix. Ie

Given a sentence that has Ipad and apple, use the knowledge graph to find what the entities "are" in an ontological sense and add those vectors to the mix: kb = a function that returns the most relevant knowledge graph result vec = a function to compute the word vector

vec(kb('ipad')) = 'product' vec(kb('apple')) = 'company'

mean(vec(ipad) + vec(kb(ipad)) + vec(apple) + vec(kb(ipad)))

Here you would just be hoping this aids classification and would be easy to compare models using standard machine learning metrics

More complex would be be to change up the feature engineering / model to not just use word vectors one idea might be to use something like an lstm where each timestep is the wordvector and a vector of most relevant knowledge graph entry and a "context" vector for what the word "is".

Trying to aide in What helps the classifier when the text is short? What helps the classifier when their are a huge number of intents?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RasaHQ/rasa_nlu/issues/513#issuecomment-329515838, or mute the thread https://github.com/notifications/unsubscribe-auth/AFMONb1T7uqVSEKiEzv53Y_1ijRuLM2xks5siUP6gaJpZM4OsXQk .

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.