dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
856 stars 269 forks source link

Starring property missing in some films [2] #565

Open staticdev opened 5 years ago

staticdev commented 5 years ago

As I can't reopen an issue, I made a second one (from this https://github.com/dbpedia/extraction-framework/issues/552).

I don't get the starring result when I enter https://dbpedia.org/sparql, and use the query: SELECT ?p ?o WHERE {http://dbpedia.org/resource/Forrest_Gump ?p ?o}

The result should be (as in Wikipedia): Tom Hanks, Robin Wright, Gary Sinise, Mykelti Williamson, Sally Field

JJ-Author commented 5 years ago

What is the difference to #552? I don't see a reason to open an issue that you do not find something in the SPARQL endpoint. This repository is for the code base (extraction framework) which you can run on your own to extract data. I pointed out in #552 that this is fixed an how you can use this fix. Do you have problems in using the working solution I suggested?

staticdev commented 5 years ago

@JJ-Author there is no working solution, as far as I've seen, it was not merged into the code. So it is still an issue, the issue #552 is closed and the problem persists. That's why I opened this one, as I don't have permission to reopen that issue.

I've also didn't see in issue #552 any reference to a commit or pull request of @chile12 that solves this problem.

JJ-Author commented 5 years ago

using the latest template-test (https://github.com/JJ-Author/extraction-framework/tree/template-test) branch IS the working solution I suggested. see the output shown https://github.com/dbpedia/extraction-framework/issues/552#issuecomment-401308813 . In case you still experience the problem for your example by using this branch, please post the output for your example.

staticdev commented 5 years ago

@JJ-Author is there a link to an endpoint using this branch code for me to test it? Otherwise, how should I try this? Sorry if this is an easy question, but I am just starting to understand inner workings of DBPedia.

JJ-Author commented 5 years ago

there is no endpoint for development versions. as i said this repository is about the extraction framework not about the endpoint loaded with latest release (2016-10 at the moment) powered by virtuoso / openlink. you can run the extraction framework locally and do some ad-hoc extraction for your articles in question. you could follow this tutorial https://extremeprinciple.blogspot.com/2017/11/setting-up-intellij-to-work-with.html but could also go for the pure commandline using mvn and then cd /server && ../run server . then browse to http://localhost:9999/server/extraction/en/

staticdev commented 5 years ago

Ok.. I will try to extract using this branch and make my own endpoint to test it. I post here after that.

JJ-Author commented 5 years ago

you don't need your own endpoint. with the link to the server component you can do adhoc extraction for individual wikipedia pages.

staticdev commented 5 years ago

@JJ-Author I am trying to follow the intellij tutorials with OpenJDK11 for the extractor/server but I am getting build errors from your fork's master branch:

/extraction-framework/core/src/main/java/org/dbpedia/iri/UriToIriDecoder.java Error:(3, 18) java: package sun.nio.cs does not exist

/extraction-framework/core/src/main/java/org/dbpedia/extraction/sources/WikipediaDumpParser.java Error:(3, 35) java: package org.dbpedia.extraction.util does not exist Error:(5, 41) java: package org.dbpedia.extraction.wikiparser does not exist Error:(8, 56) java: package org.dbpedia.extraction.wikiparser.impl.wikipedia does not exist

Do you know how can I fix this?

JJ-Author commented 5 years ago

3 things come into my mind:

And maybe send me 3 example Wikipedia pages which are of interest for you w.r.t to the starring and I can post the results with the branch here. That is maybe the fastest way.

staticdev commented 5 years ago

@JJ-Author I cloned the fork because it was the link you first provided. Now I cloned the branch from dbpedia/extraction-framework instead of JJ-Author/extraction-framework.

The problem continues, I will try to use Java 8.. I am downloading the code since I am studying how DBPedia works. It will cost more time, but I need to understand in more detail for going further with my research.

Thanks.

staticdev commented 5 years ago

@JJ-Author by changing to Java 8 instead of 11, the project now builds. Now I am doing some tests.

JJ-Author commented 5 years ago

cool. sorry for the confusion due to the copy and paste error my fork's template-test branch. you could also use that one but I suggest to use the template-test branch from official repo since it already includes some other fixes.

staticdev commented 5 years ago

My extraction seems to be taking forever. I configured it to download the 'pt' articles, according to Wikipedia there are 1 million articles, but after 12 hours it downloaded more than 23 million pages.

INFO : pt; extraction at 722:58.493s for 0 datasets; extracted 23252000 pages; 1,87 ms per page; 310 failed pages

It is also consuming a huge amount of RAM:

top - 02:23:15 up  8:21,  1 user,  load average: 2,00, 1,98, 1,93
Tasks: 227 total,   1 running, 225 sleeping,   0 stopped,   1 zombie
%Cpu(s): 44,7 us,  4,3 sy,  0,0 ni, 51,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
MiB Mem :   7868,8 total,    118,2 free,   7241,6 used,    509,0 buff/cache
MiB Swap:   8192,0 total,   6747,0 free,   1445,0 used.    307,3 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                           
 8748 root      20   0   19,2g   6,5g   3928 S  93,8  85,1   1391:00 java
JJ-Author commented 5 years ago

hm I never did a dump extraction, only ad-hoc extractions. I know that way more data will be written using this template-test branch since provenance (around 20 record json object) for every single emitted triple will be written. but it might be possible that there are memory leaks or other bugs. the only person who could help (but is really busy) is @chile12. AFAIK he tried one extraction. i can only suggest to try a smaller chapter before and see whether it terminates. i also don't know your hardware spec but 6G of memory consumption does not seem much to me. We usually run it on 256GB ram, 64 core, raid 0 ssd setup.

maybe @Vehnem can start some extraction of https://github.com/dbpedia/extraction-framework/tree/template-test for en,de,pt but I can not promise.

staticdev commented 5 years ago

INFO : pt; extraction at 1384:31.572s for 0 datasets; extracted 52434000 pages; 1,58 ms per page; 697 failed pages INFO : none; transformation at 1384:29.819s for 0 datasets; finished extraction after 0 pages with ∞ ms per page [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 23:29 h [INFO] Finished at: 2019-01-22T18:46:36-02:00 [INFO] ------------------------------------------------------------------------

Now I need to run a SPARQL endpoint upon this extraction to validate the data. What is the best approach for that? I can use Apache Jena, Virturoso, Marmota?

JJ-Author commented 5 years ago

cool. nice job! we use virtuoso. it seems best to deal with extraction errors and also has a good performance if the buffer size is not too small.

staticdev commented 5 years ago

@JJ-Author I've installed Virtuoso Opensource on Debian, set a admin and logged in into conductor. I saw there was a virtuoso-vad-dbpedia package marked with obsolete. I've also tried to get the vad package from the link provided here http://vos.openlinksw.com/owiki/wiki/VOS/VirtEC2AMIDBpediaInstall.

I think it is the last step is now. How do I point the virtuoso endpoint to the extracted data and test a SPARQL query?

JJ-Author commented 5 years ago

you don't need the vad package. it is primarily used for the linked data view. if it causes trouble just go without it. you can just use virtuoso bulk loader http://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoaderExampleDbpedia . there is also a nice docker image for virtuoso by tenforce https://hub.docker.com/r/tenforce/virtuoso/ where you can put the data into a toLoad directory.

staticdev commented 5 years ago

I went for the tenforce/virtuoso approach with docker. As in https://github.com/harsh9t/Dockerised-DBpedia-Virtuoso-Endpoint-Setup-Guide I exctracted all ttl.bz2 in one folder (ignoring the rest of my extraction products tql.bz2, redirects.obj and prov.json.bz2). I named this folder pt-ttl and mapped it as a toLoad folder as you mentioned (docker-compose.yml):

version: "3.7"
services:
  dbpedia-virtuoso:
    image: tenforce/virtuoso:1.3.2-virtuoso7.2.2
    ports:
      - 8890:8890
      - 1111:1111
    volumes:
      - /home/static/dbpedia/pt-ttl/:/data/toLoad/

After I start the container though and tried http://localhost:8890/DAV, I don't get any response. I am missing something...

My docker logs are not helping me either:

Finished converting environment variables to ini file
Connected to OpenLink Virtuoso
Driver: 07.20.3215 OpenLink Virtuoso ODBC Driver
OpenLink Interactive SQL (Virtuoso), version 0.9849b.
Type HELP; for help and EXIT; to exit.
SQL> dump_nquads(0) dump_nquads(1) dump_nquads(1) dump_nquads(1) dump_nquads(1) dump_nquads(1) dump_nquads(1) dump_nquads(1) dump_nquads(1) dump_nquads(1) dump_nquads(1) dump_nquads(2) dump_nquads(2) dump_nquads(3) dump_nquads(3) dump_nquads(2) dump_nquads(2) dump_nquads(2) dump_nquads(2) dump_nquads(3) dump_nquads(3) dump_nquads(3) dump_nquads(3) dump_nquads(4) dump_nquads(4) dump_nquads(4) dump_nquads(3) dump_nquads(3) dump_nquads(3) dump_nquads(3) dump_nquads(2) dump_nquads(2) dump_nquads(1) dump_nquads(1) dump_nquads(2) dump_nquads(2) dump_nquads(2) dump_nquads(2) dump_nquads(3) dump_nquads(3) dump_nquads(3) dump_nquads(2) dump_nquads(2) dump_nquads(2) dump_nquads(1) 
Done. -- 1 msec.
SQL> SQL> Connected to OpenLink Virtuoso
Driver: 07.20.3215 OpenLink Virtuoso ODBC Driver
OpenLink Interactive SQL (Virtuoso), version 0.9849b.
Type HELP; for help and EXIT; to exit.
SQL> Start data loading from toLoad folder
ld_dir('toLoad', '*', 'http://localhost:8890/DAV');
rdf_loader_run();
exec('checkpoint');
WAIT_FOR_CHILDREN; 
Connected to OpenLink Virtuoso
Driver: 07.20.3215 OpenLink Virtuoso ODBC Driver
OpenLink Interactive SQL (Virtuoso), version 0.9849b.
Type HELP; for help and EXIT; to exit.
SQL> 
Done. -- 3 msec.
SQL> chmod: cannot access '/clean-logs.sh': No such file or directory
Start data loading from toLoad folder
ld_dir('toLoad', '*', 'http://localhost:8890/DAV');
rdf_loader_run();
exec('checkpoint');
WAIT_FOR_CHILDREN; 
ld_dir('toLoad', '*', 'http://localhost:8890/DAV');
rdf_loader_run();
exec('checkpoint');
WAIT_FOR_CHILDREN; 

Do you know what I did wrong?

JJ-Author commented 5 years ago

http://localhost:8890/sparql is where you should browse to. loading can take several hours up to days depending on your buffer size settings and disk speed. you can check status if loading completed: http://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoader#Checking%20bulk%20load%20status

staticdev commented 5 years ago

http://localhost:8890/sparql is where you should browse to. loading can take several hours up to days depending on your buffer size settings and disk speed. you can check status if loading completed: http://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoader#Checking%20bulk%20load%20status

I can't execute queries.. neither http://localhost:8890/sparql nor http://localhost:8890/DAV works in the browser. I get "Connection Reset" error. Also, localhost:8890 seems to have a problem:

https://imgur.com/Iey8ZFn (screenshot)

JJ-Author commented 5 years ago

you can execute isql command line tool to check status. otherwise create a new container and start again. but for me it looks like it is still loading. but in general your problems with virtuoso are somehow out of the scope of dbpedia extraction framework.

staticdev commented 5 years ago

From isql: select * from DB.DBA.LOAD_LIST; I see just one ttl with error: 2 2019.2.2 11:31.43 69876000 2019.2.2 11:31.44 521623000 0 NULL 37000 [Vectorized Turtle loader] SP029: TURTLE RDF loader, line 172395: syntax error

Now the endpoint (http://localhost:8890/sparql) works, but when I try to query something like:

SELECT ?p ?o WHERE {<http://dbpedia.org/resource/Matrix> ?p ?o}

I just get an empty table with p o headers as result. I tried many movie names, something is wrong.

staticdev commented 5 years ago

@JJ-Author I tried using multiple versions of tenforce and also openlink/virtuoso-opensource-7 and there is always a problem with this turtle file:

16:22:06 PL LOG: File ./ptwiki-20190120-mappingbased-objects-uncleaned.ttl error 37000 [Vectorized Turtle loader] TURTLE RDF loader, line 172395: SP029: TURTLE RDF loader, line 172395: syntax error.

There might be a problem with this * template-test branch...

JJ-Author commented 5 years ago

yeah you probably did not run some post processing steps (a proper release has a quite complex workflow, which I am not aware of as well) simplest solution seems to use this script in order to prune the files first (have a look here how one could do that using rapper: https://github.com/dbpedia/databus-maven-plugin/blob/master/dbpedia/rapandsort.sh ) before loading them.

SELECT ?p ?o WHERE {<http://dbpedia.org/resource/Matrix> ?p ?o} seems not correct since you used the Portuguese dataset so it should be SELECT ?p ?o WHERE {<http://pt.dbpedia.org/resource/Matrix> ?p ?o}

staticdev commented 5 years ago

@JJ-Author you were right about the post processing and the query. Now it works!

Do you know when this fixes of template-test would be merged to master?

JJ-Author commented 5 years ago

I don't. @chile12 is there any progress with merging and cherrypicking.

m1ci commented 4 years ago

@JJ-Author where the dbo:starring triples should be found? Probably in https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects/ right?

I checked the mappingbased-objects files for English and there are no triples with http://dbpedia.org/resource/Forrest_Gump as subject. It seems Forrest Gump triples are not extracted. Weird. Forest Gump is hiding :smile:

Anyways, it would be nice to debug this and write a test \cc @Vehnem

JJ-Author commented 4 years ago

as i said this is already fixed in template test branch somebody needs to merge this branch. kurzum said a gsoc student will fix and extend extractors. i consider this issue as closed for myself.

Vehnem commented 3 years ago

http://dief.tools.dbpedia.org/server/extraction/en/extract?title=The+Matrix+Revolutions+&revid=&format=trix&extractors=mappings http://dief.tools.dbpedia.org/server/extraction/en/extract?title=Forrest+Gump&revid=&format=trix&extractors=mappings both examples are having dbo:starring now but this issue still requires a minidump test