Domain Adaptations - Githubissues

astrung commented 9 years ago

I have read all README.md files,and all papers.But i can not find any instructions or tutorials to build an easy application(even i know for some general steps:Questions analysis,Answer Producers,..). If i have created some data(unstructured & structured),some natural language process models (NER,POS,....),how to use it with Yodaqa??can you explain for me or write a tutorial for all??

pasky commented 9 years ago

Hi, there is unfortunately currently no process established for this, just a theoretical possibility that this could be done relatively non-painfully. Some pioneers need to explore this further personally. :-)

If you want to add new corpus of knowledge to YodaQA, some steps I've recently explained in an email to someone:

First, I recommend skimming over doc/UIMA and doc/HIGHLEVEL. If you've read some paper about YodaQA, doc/HIGHLEVEL will just repeat that in a different way, but should have more consistent terminology with what's used in the code itself.

Building of the main pipeline is in cz.brmlab.yodaqa.pipeline.YodaQA. The pipeline is quite long, but the essential part (from your POV) is the "answer production" stage.

First, it'd be good to know more about what kind of corpora are you adding. Is it a text corpus? Is it document-subject-oriented, or something like textbooks? Do you want to replace enwiki or add another corpus next to it?

The easiest thing to do is to just import it in Solr (see data/enwiki/README for some instructions) and modify the endpoint URL variable near the top of cz.brmlab.yodaqa.pipeline.YodaQA to use this instead of enwiki.

However, with enwiki, we rely heavily on the fact that there is a document per subject. There are two packages for enwiki search,

cz.brmlab.yodaqa.pipeline.solrdoc for generating answers from document titles when we have a fulltext match
cz.brmlab.yodaqa.pipeline.solrfull generates answers from matched passages, but with two strategies, either searching in document titles and using the first sentence (intro to the named subject, biographical data etc.) or searching in the text as well and using passages that contain the keywords (this might be the only thing you want to do if your corpus has no good structure).

To do search in addition to enwiki instead of replacing it, we'll need to modify some code to perform multiple lookups; might make most sense to add the solr endpoint as parameter of the SolrFullPrimarySearch UIMA component. But I recommend trying just replacing it first.

If you have some linguistic models trained for your purposes, I think you should be able to specify them as resources to the DKpro annotators (or just replace the annotators with custom ones).

I realize this is not a detailed technical guide - it's currently not something for which we'd have a walkthrough, but with some cooperation with others, we could turn it into such a thing.

astrung commented 8 years ago

I have document-subject-oriented and relation database-subject-oriented(I have created it)(about a specific domain ).Now i want to replace Freebase and wiki(all databases,if possible).So we only change databases,or we need to change Analysic Engines or CAS structure ???

pasky commented 8 years ago

In that case, it should be enough to just change the databases for starters. Let us know how it went!

astrung commented 8 years ago

But QUESTION analysis will extract correctly for specific domain????

pasky commented 8 years ago

That's a good question. In principle, it will be doing something - but of course, it may not be perfect in recognizing named entities. Later, you can improve this using your own NER model. Also, entity linking will work better if you load up your list of entities to a lookup service.

But I think it's best to do this gradually and at the beginning just swap the knowledge bases for answer production.

Another thing that will help later is typing candidate answers. An example for recognizing biomedical terms (like protein names) using GeneOntology: https://github.com/brmson/yodaqa/commit/7a1389c06b75e463797dc6fc91a336647caad21a

k0105 commented 8 years ago

Let me pick up this thread. I was the one Petr sent the explanation to. So far I have primarily been working on Watson (which turned out to be a good choice considering they will shut down the current example corpora for medicine and travel as well as the corresponding QA API in a month, so I got to play around with them) and my frontend, but I have also built a local minicluster with an octocore and two quadcore boxes with 24, 20 and 8GB of RAM respectively. They run YodaQA completely offline and I replaced the enwiki Solr DB with an example corpus [plaintext papers, not TODs (Title Oriented Documents)], which doesn't work at all :grin:

So over the next weeks and especially early next year I will try to make it useful. If someone has additional ideas what to do next, I'm all ears. @astrung Furthermore, I'm interested in your progress, astrung. Have you been able to make any progress on this over the last month? Is there anything you'd want to team up for? It seems like we might have some common goals... @pasky I'll send you a report real soon. Ideally, you'll have it when you wake up in about 3 hours. [Update: Done.]

Oh yeah, one more thing: Creating the Freebase node has been running for 29 days on an i7 quadcore with 24GB RAM now. It still seems to progress steadily, though (1.25 billion DB entries so far). Should I worry? Everything else took more or less exactly the time specified in the READMEs, but this thing is just through the roof.

Best wishes, Joe

pasky commented 8 years ago

Hi! Are you importing Freebase to Fuseki? Is it stuck on CPU or IO? Is the rate constant or slowing down? It's surprising that it's taking so long. Are you in the initial import phase or indexing phase? We may want to start another issue to discuss that, though.

k0105 commented 8 years ago

Update: Freebase setup is solved, cmp. https://github.com/brmson/yodaqa/issues/26

k0105 commented 8 years ago

While I haven't worked on this for a while I now have almost exactly 5000 papers collected (and converted) for "my" domain. With the REST backend functional [did you already have time to look at it?], my new knowledge of how JBT works and 4 weeks unassigned time until my deadline I kinda ask myself how far I could get with the domain adaptation. I'm playing with a few ideas from previous discussions:

a) Term extraction to replace the label matching. [I currently only have a glossary I wrote myself with 300 terms.] Could even do this with a UIMA pipeline. b) Disabling the headline strategy in cz.brmlab.yodaqa.pipeline.solrfull, because papers aren't TODs. c) LATByJBT as an analog to LATByWordnet.

Do you think this would make sense (in this order)? Is this the correct bottom line and prioritization from our other discussions?

PS: Can be skipped, just some other stuff I did since Monday. [[I also did some little fun tests in the mean time - wrote a random forest classifier that classifies questions into domain-specific vs general (93% with 100-fold cross-validation; my supervisor was quite dissatisfied, however, because I only have 500 questions in my dataset so far, and made me scrap it). Furthermore, I wrote a web interface, which is in a prealpha state (only a search field and button, an answer list with confidences and an animated side panel that shows details; has nothing to do with the YodaQA integration we threw together in 5 minutes I sent you, btw).]]

pasky commented 8 years ago

While I haven't worked on this for a while I now have almost exactly 5000 papers collected (and converted) for "my" domain. With the REST backend functional [did you already have time to look at it?], my new knowledge of how JBT works and 4 weeks unassigned time until my deadline I kinda ask myself how far I could get with the domain adaptation. I'm playing with a few ideas from previous discussions:

a) Term extraction to replace the label matching. [I currently only have a glossary I wrote myself with 300 terms.] Could even do this with a UIMA pipeline.

I'm not quite sure what do you mean here, sorry.

b) Disabling the headline strategy in cz.brmlab.yodaqa.pipeline.solrfull, because papers aren't TODs.

So, for TOD, we have the strategy that takes first sentence in .solrfull plus the .solrdoc strategy, and we need to disable these two for a non-TOD scenario, that's right.

c) LATByJBT as an analog to LATByWordnet.

Totally, I'm very curious about how this would work in practice.

I'm not sure what's the best way to organize things. Right now, for various "domains" of YodaQA we have seaprate branches in the d/ namespace (d/movies and d/live are actively maintained). So the easiest way would be probably to make a d/custom branch, that would have:

(b)
I guess also disabled lookup in structured knowledge bases, as that's 90% the case for specific domains I think
appropriately retrained scoring model, still using enwiki and large2180
documentation on how to make YodaQA answer queries about your collection of documents (possibly assuming finished solr import)

Does that make sense?

OTOH we want (c) in master as well.

(In the long run, cross-merging all the branches is kind of pain, especially if you develop against one of the d/ branches as is very common for me on d/movies lately - I test on top of d/movies, then cherry pick back to master, eew. We could also start track multiple scoring models and have some config files, but that'd be more effort.)

k0105 commented 8 years ago

I'll need some time to think about everything else you said. But about (a): I thought that having some keyphrases should be helpful for QA, since for a new domain we don't know what words or multiwords describe relevant concepts. I then tried to write down a list of key concepts myself, got about 300 concepts and saw that this isn't enough. Hence, in order to acquire this vocabulary automatically, I wrote a keyphrase extraction: I convert my documents into decent plaintext Strings, make sure they are English by applying some language detection, then extract keyphrases with one of several backends (currently AlchemyAPI or a local implementation) and then do some cleanup and majority voting to find out which keyphrases are relevant for the entire corpus. This way I get multiple megabytes worth of key phrases out of my 5000 documents. For instance, for e-learning papers I would get concepts like "distance learning", "double loop learning", "LaaN theory", "educational data mining" etc. That should be helpful for domain adaptation, right?

My first idea was to put these concepts into a label service like the Fuzzy Label Lookup Service you use for DBPedia. But this might be naive, because I haven't really looked into what it does in detail, yet.

pasky commented 8 years ago

Oh, having the code to do this would be awesome, this sounds pretty neat!

The step we perform in YodaQA right now that I might have omitted in the discussion above and is related to this is Concept Linking, and the way it's done and used is a bit tied to TOD as well:

Certain substrings of the question are looked up via the "label-lookup" infrastructure (two services, one performing fuzzy lookup that allows some edits, another one perfomring normal lookup in crosswikis dataset) which detects if the substrings refer to some concept (which means enwiki article here)
If a link (enwiki pageId associated with the substring) is established (which involves the above + a classifier to prune candidates), the substring is promoted to a high-priority clue that also carries over the link
The resulting clue is used for search as any other clue, maybe with higher priority; it's useful for grouping words together that otherwise might not be, e.g. "The Simpsons" or "Marge Simpson" go together as keyphrases because they would be linked to concepts; adjecent clue matches also matter when extracting answers from phrases so when we have more accurate clues, it helps scoring the answers better too
The link is used (i) to directory fetch the article as an extra fulltext source aside of solr search results; (ii) to use concept properties in RDF knowledge base as candidate answer

So, when we don't have a TOD corpus, we should disable label-lookup by default!

But if we get your mechanism, we wouldn't have the link, but still would benefit from it due to better clues. How important it is is hard to say, it's probably not revolutionary but it should certainly have some impact. I think the best way would be to reuse the label-lookup infrastructure (the fuzzy part, though if you have a way to detect synonymous references to the same concept, that'd belong to the crosswikis part), but provide a null link.

Makes sense? I think it's not the toppest priority necessarily though. But if you need a reason to get the code for what you described out there, I'd totally be for it. :-)

k0105 commented 8 years ago

Oh, having the code to do this would be awesome, this sounds pretty neat!

Sure. Once I'm done with my paper in early April we'll have to sift through all my stuff and see what you want and how to integrate it. Also: Thanks for the extensive answers, they are very helpful and I highly appreciate it.

k0105 commented 8 years ago

I just got a dump from a wiki about my domain and I'm considering an attempt to put this into YodaQA. Transforming the Mediawiki dump XML format into Solr's XML format is fairly trivial (esp. since I've done that before with my papers) and, similarly, creating the static labels should be straightforward by generating sorted_list.dat myself. Let's pick an arbitrary example from the current file: !Ay, caramba! %C2%A1Ay,_caramba! 1880521 0 !Ay, Carmela! (film) %C2%A1Ay_Carmela! 10558683 0 [%C2%A1 is URL encoded unicode for '¡', the numbers are Wikipedia's curid and the last number apparently says whether the title belongs to an article directly (1) or is a redirect (0).

But: I just read the paper about Sqlite labels ("A Cross-Lingual Dictionary for English Wikipedia Concepts") and this seems way more involved. The authors even explicitly state in their summary that "[t]he dictionary [...] would be difficult to reconstruct in a university setting".

Do you think it's worth a shot despite of all this? This is neither essential for my paper nor do I really have any time left for it, but I would really like to have the capability of adding arbitrary wiki dumps to YodaQA. I mean we finally have the distributional semantics, the key phrase extraction, the converted corpora and the TODs - I would really like to see it all come together now. Any hints or remarks to tackle this? What should I do with the extracted keyphrases (that I could also expand with my distributional semantics model to e.g. get similar terms) - can I just throw them in the label service or do anything clever with them without modifying the UIMA annotators [which I don't have time for until my deadline]?

Thanks in advance.

Best wishes, Joe

pasky commented 8 years ago

Hi! I think you are setting your goals maybe too high for your initial work on this. :) The fuzzy labels dataset is just to make it more robust to different spellings, nicknames etc., and by default that's extracted from Wikipedia corpus - but if you don't need that, it's fine to just include mapping from concept name to its canonical article by id, without anything more involved.

I think maybe the most elegant solution would be writing a script that takes Solr-import XML and generates the labels dataset to load into labels-lookup based on that. This should be pretty trivial and an universal solution for whatever corpora anyone can massage to the XML dump format. Does that sound sane?

k0105 commented 8 years ago

Yes, this is exactly the minimal solution I've been thinking about. So I wrote the wiki parser - it only needs one pass over a MediaWiki XML dump, filters out all the offtopic pages and simultaneously both adds the remaining TODs to a label file as well as to a Solr XML input file in plaintext with any MediaWiki or HTML markup removed (well, in theory, but it is reasonably clean). So far, so good. Solr and the label service work, I took my offline version of Yoda 1.4 I still had around (standard 1.4 release with online backends replaced by localhost), but I currently get:

*** http://localhost:5000 or http://localhost:5001 label lookup query (temporarily?) failed, retrying in a moment...
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at java.net.Socket.connect(Socket.java:538)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
        at sun.net.www.http.HttpClient.New(HttpClient.java:308)
        at sun.net.www.http.HttpClient.New(HttpClient.java:326)
        at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
        at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1513)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441)
        at cz.brmlab.yodaqa.provider.rdf.DBpediaTitles.queryCrossWikiLookup(DBpediaTitles.java:287)
        at cz.brmlab.yodaqa.provider.rdf.DBpediaTitles.query(DBpediaTitles.java:100)
        at cz.brmlab.yodaqa.analysis.question.CluesToConcepts.process(CluesToConcepts.java:97)
        at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:309)
        at cz.brmlab.yodaqa.flow.asb.MultiprocessingAnalysisEngine_MultiplierOk.processAndOutputNewCASes(MultiprocessingAnalysisEngine_MultiplierOk.java:218)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator$1.call(MultiThreadASB.java:772)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator$1.call(MultiThreadASB.java:754)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

which is strange, because the label service reports actual requests, for instance:

::ffff:127.0.0.1 - - [05/Feb/2016 09:50:02] "GET /search/Cognitivism?ver=1 HTTP/1.1" 200 -
searching Cognitivism
found:
[{'canonLabel': 'Cognitivism', 'dist': 0, 'name': 'Cognitivism', 'matchedLabel': 'Cognitivism', 'prob': '0'}, {'canonLabel': 'Connectivism', 'dist': 3, 'name': 'Connectivism', 'matchedLabel': 'Connectivism', 'prob': '0'}]

I don't run any backends besides Solr and one label service - is it trying to reach the SQLite backend on 5001 (which is not there) or what is happening?

Update: OK, so I modified the code to have Freebase and DBpedia on two different ports, started them accordingly and then just ran everything (2 label services, Freebase, Solr, DBpedia and Yoda with Solr and one label service using the data from my wiki dump). It now gets further (which means that Yoda has a problem when there is only one label service - it then blocks with java.net.ConnectException: Connection refused), but it still throws errors:

5970d004988310949685ff957/de.tudarmstadt.ukp.dkpro.core.api.lexmorph-asl-1.7.0.jar!/de/tudarmstadt/ukp/dkpro/core/api/lexmorph/tagset/en-ptb-pos.map
INFO ResourceObjectProviderBase - Producing resource took 0ms
Feb 05, 2016 10:29:17 AM org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl callAnalysisComponentProcess(417)
SEVERE: Exception occurred
org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:401)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:309)
        at cz.brmlab.yodaqa.flow.asb.MultiprocessingAnalysisEngine_MultiplierOk.processAndOutputNewCASes(MultiprocessingAnalysisEngine_MultiplierOk.java:218)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator$1.call(MultiThreadASB.java:772)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator$1.call(MultiThreadASB.java:754)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
        at java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1011)
        at java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:1006)
        at cz.brmlab.yodaqa.analysis.tycor.LATNormalize.process(LATNormalize.java:145)
        at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
        ... 8 more

Feb 05, 2016 10:29:17 AM org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl processAndOutputNewCASes(274)
SEVERE: Exception occurred
org.apache.uima.analysis_engine.AnalysisEngineProcessException
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator.processUntilNextOutputCas(MultiThreadASB.java:1074)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator.<init>(MultiThreadASB.java:496)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB.process(MultiThreadASB.java:416)
        at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:266)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator$1.call(MultiThreadASB.java:772)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator$1.call(MultiThreadASB.java:754)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.ExecutionException: org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator.collectCasInFlow(MultiThreadASB.java:814)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator.casInFlowFromFuture(MultiThreadASB.java:603)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator.nextCasToProcess(MultiThreadASB.java:716)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator.processUntilNextOutputCas(MultiThreadASB.java:1039)
        ... 9 more
Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:401)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:309)
        at cz.brmlab.yodaqa.flow.asb.MultiprocessingAnalysisEngine_MultiplierOk.processAndOutputNewCASes(MultiprocessingAnalysisEngine_MultiplierOk.java:218)
        ... 6 more
Caused by: java.lang.NullPointerException
        at java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1011)
        at java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:1006)
        at cz.brmlab.yodaqa.analysis.tycor.LATNormalize.process(LATNormalize.java:145)
        at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
        ... 8 more

Oh, one more thing: Of course, I made sure that everything else is correct: When I replace the Solr file and the label file with the Wikipedia ones, YodaQA runs perfectly and fully local.

pasky commented 8 years ago

That's right, it's trying to reach the SQLite backend. Either disable it in YodaQA code, or run one with empty database.

k0105 commented 8 years ago

Just for the record: While running with an empty DB doesn't work for me, deactivating the second label service in the YodaQA code (just return an empty list instead of performing the actual lookup) does work. Hence, I might have the first (admittedly rather naive) domain adaptation: The system can now use the Edutech-Wiki content. Definitions work fairly well (What is e-learning, mobile learning, blended learning etc.) and when one is willing to consider the Top 5 even questions like "Which learning theory did John Dewey [or Piaget etc.] contribute to?" are often answered correctly. While it is certainly not ready for productive use in research and the recommended articles still link to Wikipedia instead of the other wiki, it seems like a decent stepping stone for more involved domain adaptation. I guess the next step is to get my hands dirty and introduce my JBT backend to the mix. Ain't gonna be pretty. but someone needs to do it.

PS: One more thing - I've benchmarked my local Yoda instance. SSDs make it about 2.8x as fast, but the quality of the SSD didn't matter - a consumer SSD with about 550MB/s led to results just as fast as a professional SSD over PCIe with effective 1500MB/s. Of course, you'll find a detailed report in my paper.

pasky commented 8 years ago

Awesome work! When you are able to do so, it would be awesome if you could publish your code or step-by-step guide for importing the Edutech-Wiki and using it for YodaQA.

k0105 commented 8 years ago

Sure. I'm busy for the next 9 days, but afterwards I'll just finish my paper and send you the whole thing including detailed reports on everything we talked about.

tanmayb123 commented 8 years ago

@k0105 Any update yet? Thanks.

k0105 commented 8 years ago

We've played with the topic classifiers. Again, simplicity triumphed. While random forests with tf/idf, word lemmatization and snowball stemming work well, I always wanted to try SVMs on this and indeed: While the former solution yields around 92%, the latter gives 94-5%. You can find the code here: https://github.com/yunshengb/QueryClassifier/blob/master/sklearn/train.py

I'm currently reimplementing the ensemble so it's independent of JavaFX (almost done) and then I'll move over to the wiki parser and keyphrase extraction. I'll most likely just upload it to Github. Slowly but surely I'm getting through the material.

k0105 commented 8 years ago

@nazeer1 So, have you been able to hook up your own triple store? That sounds pretty interesting...

You can switch DBpedia and Freebase to another Fuseki backend by setting system properties - try -Dcz.brmlab.yodaqa.dbpediaurl="http://yoursource:1234/yoursource/query" If you get something decent, please post a quick summary of your findings. I'd love to read it.

ghost commented 8 years ago

@k0105 thanks for your comment. I dont know on which path shall I run your comment ? I uploaded my RDFs into Jena fuseki, the backend URL is : "http://localhost:3030/dbpedia/query". I changed it with dbpedia and freebase URLs in all classes in "provider.rdf" package. I also updated the "fuzzyLookupUrl" and "crosswikiLookupUrl" with my URL also. and then I had to comment everything in "DBpediaTypesTest.groovy" class. now when I run the web view and search for query it keeps searching and I get this error message in teminal : *\ http://localhost:3030/dbpedia/query or http://localhost:3030/dbpedia/query label lookup query (temporarily?) failed, retrying in a moment... it is from "DBpediaTitles" class.

can you please give me a hit ? , or if am doing something wrong. I want the yodaqa to search only in my domain which is RDF, I want all other domains to be deactivated. still don't know how is that possible.

pasky commented 8 years ago

If you simply substitute dbpedia URL with your RDF domain, that will not affect just the knowledge base used to generate answers, but also further answer scoring components, which you might not want to do, and if you do want to, you'll have to rewrite classes like DBpediaTitles.

So, the easiest route is to keep DBpedia answer scoring components at first, and change just the answer generator component, which is DBpediaOntology (or DBpediaProperties, they are almost identical); clone the DBpediaLookup class to MyKBLookup class with appropriately modified URL, and clone DBpediaProperties to MyKBProperties, with SPARQL query that is appropriate to your knowledge base. Finally, clone DBpediaProperty* classes in pipeline.structured package to MyKBProperty* classes, using your rdf provider instead, and put them in the main YodaQA class pipeline code instead of the previous knowledge bases.

If you want to perform more complex question answering than just a direct entity attribute, you should look at FreebaseOntology instead, but it's more complex code too.

Finally, if you want to answer questions only from knowledge bases rather than from a combination of kb and unstructured texts, base your work on the d/movies branch rather than master.

HTH

ghost commented 8 years ago

Thanks @pasky , @k0105 for your hints, I still have some problem with adapting my domain with yodaqa : At first I would like to ask weather if YodaQA can answer queries from set of simple RDFs , or do I need to build an ontology first ? “ I have a RDF files which shows some information about set of objects”

For another: for changing the domain I have done the following steps and face following problems, I would really appreciate it if you can give me a hint how to solve them: _First I followed the instruction in ReadMe file at dbpedia, so I uploaded dbpedia 2014 on my localhost "apache-Jena-Fuseki" successfully, then I created the MyKB classes as per your instruction, it works at first, but when I comment any of the "DBpediaOntologyAnswerProducer" or "FreebaseOntologyAnswerProducer" endpint in YodaQA pipeline and I run any query it will remain in searching process and shows this message in terminal: INFO LATByWordnet - ?! word ....._ l of POS NNS not in Wordnet.

*\ Then when I added some of my data to DBpedia RDF files for testing, "keeping the dbpedia structure so I wont need to change the Sparql code" it also remain in process for long time showing none of the data that I added and shows this message in terminal. INFO FocusGenerator - ?. No focus in:

****For the last I would like to ask if I should change this two URL in DBpediaTitles.java class, protected static final String fuzzyLookupUrl = "http://dbp-labels.ailao.eu:5000"; protected static final String crossWikiLookupUrl = "http://dbp-labels.ailao.eu:5001"; since am using apache jena I dont know which url would be equivalent to it. my query endpint is "http://localhost:3030/dbpedia/query".

Thanks in advance .

k0105 commented 8 years ago

There are no such entries in DBpediaTitles anymore, cmp. https://github.com/brmson/yodaqa/blob/master/src/main/java/cz/brmlab/yodaqa/provider/rdf/DBpediaTitles.java and yeah, I'd change those. As aforementioned, you should disable the second labeling service by returning an empty list and parse your data to create an input file for the first label service. For the other questions, I have to refer to Petr, since he is much more knowledgeable about Yoda than I am.

ghost commented 8 years ago

Many Many thanks @k0105 for information, well I am using d/movies . there sill I have the old version of code in "DBpediaTitles", but I did change the code with new one from the link that you mentioned. now it is more clear to me and will try to disable second label and parse my data into the first lable as you said, hope it works. I will update my results with you ASAP. Regards,

k0105 commented 8 years ago

Please do - if you really manage to connect your own triple store and get a working domain adaptation that would be quite interesting to read about.

brmson / yodaqa

Domain Adaptations #17