Suite of Yoda Enhancement Backends

k0105 commented 8 years ago

Hi Petr, Hi Yoda community,

since we need to get some of my additions into Yoda soon I hereby propose the following REST backends:

[X] Distributional Semantics (done): Already contributed (https://github.com/brmson/jobimservice), but I can add word embedding options
[x] Ensemble: Here I can collect additional interfaces to existing or highly specialized QA pipes
[x] Keyphrase Extraction: Keyphrase Extraction via IBM, RAKE and possibly more approaches

I also plan on following Yoda's approach to offer a CLI (run) besides REST (web) in the future, since this should produce less overhead than REST for thousands / gigabytes of papers and books.

Is this useful to you or are there any objections to this / things to improve?

Best wishes, Joe

vineetk1 commented 8 years ago

What is included in "Ensemble" ?

pasky commented 8 years ago

Hi! Please see http://log.or.cz/?p=403 for details of the ensemble.

k0105 commented 8 years ago

What I'm kinda worrying about - REST interfaces are nice and clean, but I'm not sure about performance. The Lucida devs have picked Apache Thrift ( https://thrift.apache.org/ ) instead, another solution I've been thinking about is ZeroMQ. For the ensemble I don't care, because I give it a question, it does its thing, however complex, and returns a comparably small JSON string, but document conversion can yield gigabytes of output for even bigger inputs. So far much preprocessing for Yoda is just Python and I've used Python for a good part of my keyphrase extraction and classification as well, but as much as I like Python we need to be able to share data between components in different programming languages.

For the core stuff Docker + REST is rock-solid, but the more we get into remote territory the more seductive a high-speed alternative becomes. Note that Thrift and ZeroMQ are rather different beasts: The former is high-level and easy to use, whereas the latter is low-level and hence involves more implementation effort, but also enhances chances for getting top-notch performance. Any hints or opinions are highly appreciated - I haven't been able to make up my mind so far. but to be honest: I'm considering ZeroMQ, because it should be more efficient and efficiency is our main concern. My future contributions will be more complex and could be called much more frequently, so I feel there are quite a few options for optimization if we pick 0MQ. Otoh, our complexity is in data structures, not network topology or message patterns, which is why Thrift certainly has its benefits.

But even then: I've never written backends that support both REST and Thrift/ZeroMQ. Are there any design patterns for this or does just using a CLI parameter suffice?

pasky commented 8 years ago

@k0105, thanks a lot for picking this up!!! I've edited the original post in this issue slightly, hope that's ok.

I think this is a really great plan. If we're doing something in YodaQA proper that can be identified as self-contained and well-defined (and often resource-hungry), we are trying to nowadays put it behind a REST API as a standalone service, because (i) it improves YodaQA prototyping by saving resources and load times, (ii) can be reusable in other structures than YodaQA, e.g. we use label-lookup for multiple purposes already. So this is a really great initiative. And the combination with Docker makes this all so much more awesome! We plan to migrate to your Docker setup during the summer at the latest.

Are you planning to also contribute some tools that use the APIs, or do you want to focus just on the API servers? We had a plan for JoBimLAT, do you still aim for that? Re Ensemble, you probably use this as PalmQA - not sure if you plan to open source PalmQA itself, but it would be a great fit for the https://github.com/brmson/hub YodaQA frontend if not (and we'd probably want to merge a lot of PalmQA otherwise).

Re Document Conversion, we'll probably have to put some thought into exactly how the APIs should be structured; maybe a separate issue for ironing out the details of that... (whether the API should be itself submitting things to solr and saving them to some files, or have several APIs, or exactly what the workflow would be; I guess my underlying feeling is that command line tool for the document conversion would be more appropriate, but maybe I've misunderstood the precise idea here)

pasky commented 8 years ago

Re word embeddings - we already have word embedding implementation in YodaQA itself (the GloveDictionary class). So far, we use it for KB property-LAT alignment scoring, though we have tried to prototype a lot more on top of this in different branches. In the longer run, we could outsource this to an RIGHT API, but we already have one - the sts-scoring.py of https://github.com/brmson/dataset-sts - it can wrap complex deep learning models for processing sentences, but also a basic model "avg" which is the equivalent of what we use GloVe dictionary for in YodaQA (so we could in theory completely outsource this to sts-scoring right now; there is just a certain vocabulary-related practical hurdle involved).

pasky commented 8 years ago

Re REST vs. Thrift / ZeroMQ - my immediate opinion would be to primarily build REST APIs, and if we find a performance bottleneck, build an alternate interface that can be used alternatively to REST.

Thoughts behind that: It seems to me that unless we are doing thousands or more requests per second, it's unlikely the REST API would become a bottleneck, and I don't see.either of these APIs a priori being used like this now (once per answer is probably the worst case for anything we consider; we could also perhaps batch the requests asynchronously in a single HTTP call in a fairly straightforward way). Also, REST is immensely practical since it can be used from basically any language, it's easy to do, comprehensible for newbies and also easy to debug (you can even use curl from command line). Plus, REST makes the technology we create here easy to use outside of YodaQA context too - even for example in Javascript apps at random parts of the internet. So even if we'd find out that REST is bottleneck somewhere and moved to prefer ZeroMQ for that purpose, I think there would be value in keeping the REST API too.

You have better practical outlook on this since you already did the work behind these REST APIs, so please feel free to correct me!

k0105 commented 8 years ago

Heh - I've been using gensim's word2vec. Seems like we have this area fully covered.

I fully agree with you on REST - it's there to stay. Only question is whether we want something faster for certain scenarios. I feel a bit stupid to send several gigabytes of papers over loopback and regarding a CLI tool the problem is that we might want to add papers dynamically and from within applications, so a clear API seems more desirable. For practical reasons I might add Thrift and play with 0MQ afterwards. As you know, I have a certain interest to make my stuff easy to use for Lucida.

Btw: The ensemble REST interface is done. And I will play with LatByJBT once writing is done.

I also used IBM's document conversion service via REST [before GA we could use it for arbitrary amounts of data for free] and while it's great for individual files and small corpora, to convert huge corpora I certainly prefer my local code.

pasky commented 8 years ago

In the document import scenario, it'd probably make more sense to me to have REST API + commandline API rather than REST API + ZeroMQ/Thrift. Because in the end, you'll want a commandline-run ZeroMQ/Thrift client anyway. But maybe I don't have all the possible usage scenarios in mind?

Of course, adding papers on the fly would be awesome; but adding a support for this to label-lookup will be a big change (not saying unfeasible! big as in relative to the small existing script :). Actually using the papers for QA will be hopefully more feasible when my deep learning work in dataset-sts comes to further fruition.

pasky commented 8 years ago

Awesome news re the ensemble REST interface! And thanks for the IBM experience snippet.

BTW, I've been exploring https://github.com/claritylabs/lucida a bit and they seem to be focused on using Apache Thrift as an interconnect. From my POV, that's a completely fine motivation for supporting that in addition to REST APIs wherever it makes sense. (But I'd still prefer REST as the primary interface.)

k0105 commented 8 years ago

Well, perfect. Seems like we have a solution then: I'll add a CLI option like Yoda (run / web, where web means REST) and a Thrift option, while 0MQ moves to my "play with it in my freetime" list for special scenarios for now.

The ensemble interface should be easy to extend with this functionality, so we only have to worry about document conversion. As you know, I have code to convert wikis and documents and generate labels, so we only have to figure out what a decent API for this should look like you said before.

IBM's API is very simple: There is essentially one call for conversion that takes "5 parameters" - username and password for authentication, some conversion flags, the file (maximum size 50MB - one of the limitations that bugs me, not that I have too many PDFs that exceed it, but they do exist), the file format (optional, only for cases where the system messes up) and the API version. Error codes are as usual: 200, 201, 400 for invalid parameters, 401 for authentication problems, 404, 500.

We can probably take a similar route, but we need:

File types wiki dump and RDF dump besides PDF and EPUB (I already support all, but used some trivial adjustments by hand)
A function to generate label files (I've implemented one for each label service)
A function for keyphrase extraction (I've implemented / ported one)
A parameter to submit the current file to a certain corpus - this needs authentication, since internet users will submit pirated content otherwise, maybe also a corpus ID, so one can have multiple corpora side-by-side?
Something about language detection (maybe done). As you know, I use language detection to only allow English documents until Yoda is multilingual. Maybe we should just do this by default: If users request a conversion, we just return plaintext without complaining, but for adding anything to a corpus we enforce English for now, because otherwise Yoda breaks down

Any objections or remarks? Otherwise, I'll hack together a document conversion prototype by the end of the month.

pasky commented 8 years ago

Hehe - we actually have a student project https://github.com/brmson/Personal-Assistant - which is something very simple, outsourcing intent to wit.ai and focusing on weather for now, and docs is still work in progress - that uses ZeroMQ, just to add even more variety to all this! I hope that in a few months we'll start converging.

What you outlined sounds good...

What authentication scheme do you have in mind? Seems like everyone loves OAuth2, but I haven't had the time to familiarize myself with it yet...
Multiple corpora and multiple languages sound great, even if YodaQA proper doesn't support them yet, it's good to be future proof since that'll come to YodaQA sooner or later (we already have a partial hackish port to Czech in the d/seznam branch).

In general, it'd be good to think about:

Whether we have some concrete expected uses of the API besides adding to YodaQA's collections; if not, that's fine, but if so, good to take them into account.
What the call flow should be like - upload to a single endpoint and synchronously return all types of replies at once / add to collection, or something else? (It sounds like a good default, I'm not sure if there are reasons to do it otherwise, just confirming.)
Have the REST API as RESTful as possible :) i.e. using the common standards, paths instead of query parameters where it makes sense, HTTP headers and methods properly, etc. But maybe this is stating the totally obvious.

Great prospects!

k0105 commented 8 years ago

So I've arrived in Michigan. We're currently extending the backends. Essentially, they will get a Thrift interface and be encapsulated in dedicated Docker containers. I'll let you know once they are done.

k0105 commented 8 years ago

I'm currently working on setting up a ton of additional QA pipelines and I've actually renamed my current ensemble to webqa in order to write a more involved ensemble. Progress is good, I will report back once it's done. And I have some ideas about how to improve the other services as well.

k0105 commented 8 years ago

Almost done - we now have ensembles for distributional semantics, QA and keyphrase extraction. Just need to wrap the last backends in Docker containers and merge everything into one generic ensemble. Soon.

k0105 commented 8 years ago

OK, my initial backends are done. I have sent you pull requests for Yoda-related Dockerfiles and for code to dynamically change backends in Yoda. You can already find multiple ensemble parts in my repo. I'll add more soon and also invite you to some private repos. For instance, I'll keep my interface to other QA systems private in order not to rub anyone the wrong way and just add you so you can take a look anyways.

pasky commented 8 years ago

Neat! Sorry PR #48 took so long. :( I'll try to be better about that. Since my students were finishing their theses and I was taking a little psychological break from question answering, there wasn't so much movement overall anyway.

So, your ensemble wrapper seems to be at https://github.com/k0105/ensemble ...

k0105 commented 8 years ago

All 3 proposed backends are done, the ongoing integration can now be discussed in their respective threads, which already exist.

brmson / yodaqa

Suite of Yoda Enhancement Backends #43