brmson / yodaqa

A Question Answering system built on top of the Apache UIMA framework.
http://ailao.eu/yodaqa
Other
619 stars 205 forks source link

Distributional Semantics #27

Open k0105 opened 8 years ago

k0105 commented 8 years ago

Background

So, I've suggested to integrate Distributional Semantics into the YodaQA pipeline by using JoBim Text, a framework developed by TU Darmstadt (in Germany) and IBM that is also used for domain adaptation in Watson. It provides a way to acquire domain knowledge in an unsupervised way. For instance, UMLS, a huge ontology in medicine and frankly one of the most extensive ontologies I've worked with, covers most concepts in medicine, but still misses quite a few relations. With distributional semantics this ontology can be "completed", the need for knowledge engineering is heavily reduced and it even scales better than conventional triple stores.

Work already done

So I've looked into JBT and can now generate models for my own corpora, so I can compute similar terms, contexts and even labelled sense clusters based on Hearst patterns found by applying UIMA Ruta. Furthermore, Dr. Riedl, one of the main developers, has kindly agreed to provide their Wikipedia Stanford model, which saved us a lot of computation time. Additionally, we could use a web service they offer, which also features a Wikipedia trigram model.

Example

Let's for instance look up the word "exceptionally": The framework recognizes that "exceptionally#RB" is similar to terms like "extremely#RB", "extraordinarily#RB", "incredibly#RB", "exceedingly#RB", "remarkably#RB" etc. It can provide accurate counts for these terms and it can provide context to e.g. distinguish "cold", the disease, from "cold", the sensation. And finally we can group these interpretations, e.g. the trigram output actually distinguishes the meaning of the term as in "[extremely, unusually, incredibly, extraordinarily, ..." from the meaning "[attractive, intelligent, elegant, ...", which is quite clever imho.

What now?

So, there are a couple of things JBT can be used for. The most prominent example is TyCor: Expand the concept to infer type constraints and match those to the LAT. That's why I already asked whether it makes sense to add the functionality to cz.brmlab.yodaqa.analysis.tycor.LATMatchTyCor.java first.

But even more important than one particular use case might be to make JBT generally available to the pipeline. When we discussed scaling Yoda here: https://github.com/brmson/yodaqa/issues/21#issuecomment-170403540 Petr mentioned that he strives to encapsulate computationally intensive tasks behind REST interfaces (essentially microservices, I guess). Watson uses distributional semantics all over its pipeline and some benefits might only become visible once the pipeline is extended to domain knowledge. Hence, I suggest to make JBT available as another data backend just like Freebase, DBPedia and enwiki, before using it in any particular stage of the pipeline. We can then try using it in various places and see where we obtain better results. I would also write a detailed README, so people get up to speed quickly.

I started this thread a) to track progress and b) to ask for comments. Does anyone have additional ideas where or how to use JBT in YodaQA? Do my ideas make sense or can you think of a better approach?

Best wishes, Joe

pasky commented 8 years ago

Awesome, thanks a lot for starting this as a github issue now. We certainly have a lot to talk about here, let me try to sort that out a little:

JBT Provider

We need to implement JBT RESTful microservice that does the heavy lifting, keeps stuff loaded in memory etc. I guess this at least partially should already exist within JBT as they offer a web interface, if we can just reuse that, it's the best option. In cz.brmlab.yodaqa.provider, we'll probably need a subpackage .jbt or something with classes that provide the access to this to the rest of YodaQA, possibly with some caching later on.

Overally, should be reasonably trivial? It's fine by me to just implement whatever we need for initial usage within the pipeline, we don't have to cover all the features all at once (might even not be 100% desirable from dead code perspective).

JBT for LATs

Equally importantly, we want to scout for ways to use JBT in YodaQA. I completely agree with you that using JBT for smarter type coercion is the best first application! It should be easy to do, it's a pretty well-defined task and could have a nice impact.

I'll elaborate in a followup comment.

JBT - other usages

Other ideas for using JBT, sorted roughly by difficulty I guess:

There are surely some much more sophisticated uses for JBT, this is just initial ideas without still having a lot of experience with it.

pasky commented 8 years ago

JBT for LATs

So, to recapitulate how LAT tycor works:

Now, the most obvious+easy way to add JBT to the mix is to create and employ an analog of tycor.LATByWordnet that would add the generalized LATs based on JBT. This should be quite straightforward, I guess.

However, of course there may be a lot fo generalized LATs generated by JBT, quite a lot more than from Wordnet. (It'd be interesting to see that, how many do we have for, say, nouns like "novelist" or "microplanet"? What if we take top 5?) In that case, we would want to instead create an analog to LATMatchTyCor that just looks at the LATs and internally crosschecks them without saving the full lists.

But on the other side I don't really see a big harm with even storing 100 LATs in the CandidateAnswerCAS, if it stirs some trouble, we can improve on that later. So maybe we could just prototype by implementing LATByJBT as an analog to LATByWordnet and adding it to the pipeline on the same places and we should immediately see some action?

k0105 commented 8 years ago

Thank you very much for your replies. As a result of our conversation: I will build a JBT REST interface that is also able to use TU Darmstadt's web service and document it in such a way that people can easily apply it to custom corpora. I will also have to finish a classifier and a new rule-based system for my project, but you can expect results in one to two weeks.

Hopefully right after that I will pick up your idea of LATByJBT (that I like a lot) to try an example. I will have to see how much time I have left - right now it seems like I can work 2 weeks on this in total. [Worst case: I have to take a break to finish my thesis and continue 6 weeks later. But the backend is definitely done before that so you could play with JBT in the meantime if you want.]

I'll report back as soon as the backend is done and let you know about my schedule for the remaining work.

pasky commented 8 years ago

Awesome, I'm looking forward to that, thanks!

k0105 commented 8 years ago

Just for the record: I just sent you a prototype of the REST interface.

I will now focus on writing a paper (primarily - feel free to contact me any time), which should take approximately 4 weeks (with some related work) and after that I will be back on scaling and distributional semantics.

vineetk1 commented 8 years ago

JBT looks promising, and I would start using it when it is generally available for YodaQA. What will the data backend consist of? Will it have the Standford Wikipedia model? Will it also have the Trigram model?

k0105 commented 8 years ago

It already supports both. We are just discussing a minor detail about the return values, but the backend should be available very soon.

Update: The functionality is done, but we agreed on providing JSON return values as well. Since I just killed the development system, I'll have to set up my databases again (good way to verify the instructions), add this and then you'll find a dedicated brmson repository for the JoBim Text backend, probably by the end of this week. I'll post another reply then, so you'll get notified when it's done.

Update 2: I've just sent Petr the updated version of the REST service [18.1.'16, 20:20].

k0105 commented 8 years ago

Note for later: https://github.com/brmson/yodaqa/issues/30 should have synergies.

pasky commented 8 years ago

I didn't review the code in detail or set up the endpoint yet (maybe I'll have to swap mysql/mariadb for sqlite through the process, maybe not), but in order to keep the momentum, already pushed this out as https://github.com/brmson/jobimservice ! Thanks a lot for contributing this.

k0105 commented 8 years ago

I don't think replacing MySQL with SQLite is feasible for several reasons:

PS: Dr. Martin Riedl has pointed out that it is fairly easy to switch the API to databases that can be used via JDBC by simply adapting the SQL commands in the configuration. I can confirm this - I had to change some commands to switch from MySQL to MariaDB and that was - as expected - trivial. From a technical perspective it's easy to do, one just has to be sure that this is really the right solution. If anyone goes over these points and spots no problem for his/her scenario, switching to SQLite should be straightforward.

pasky commented 8 years ago

For me, the motivation is that I already have bunch of things in my existing MySQL instnace and prefer all the YodaQA-related databases running standalone and on SSD-backed store. But it's no big deal, and all your points make sense too! So I'll just import it into my MySQL/MariaDB instance, let's see how that goes.

vineetk1 commented 8 years ago

@jbauer180266 Thanks for your help in installing JoBim. I have written a wiki page on How to install JoBimText

k0105 commented 8 years ago

I should point out that my ensemble for distributional semantics now supports GloVe and word2vec besides JoBim Text. I won't have time to play with integrating it into Yoda for the next 2.5 months, but the functionality is there. Should make a nice paper. Hence, I will likely do it one day, but please feel free to steal it from me. If someone starts this before me, just please let me know so we don't duplicate efforts.