Bulk Loading of large RDF file and matching by triple pattern

laurensdv commented 10 years ago

Suppose I have a large datadump in Turtle or Ntriples on my localdisk, how can I load into the store? Do I need to publish it on a local webserver, and use the uri of the file to index it?
Once I have loaded the entire file containing triples, can I use a filter operation to match ?s ?p ?o where I bind any of those ?s ?p ?o to a resource or a literal? or does it only support 'match' -> does it mean that the matching occurs by string?
If I have a SPARQL query that wasn't indexed before can I send it to the indexed version of the data resource of my file? Or is SPARQL not supported as such and are just the results of a SPARQL query indexed, so if it's a first time request it would need to be sent to the original endpoint and then the second time it.

iulia-pasov commented 10 years ago

Hello,

At this point only the RDF-XML format is accepted for URIs. You need to send the URI or URL where the data is stored in this format, as an argument. We will add this feature soon, so if you need it you can either add it or wait for the next release
I don't think I understand your question very well. Once the triplets are indexed, you don't need to store variable names. They only occur in the river, but they are not indexed elsewhere. You can filter by predicate-object with the desired values.
Yes, you can add a new river with a new sparql query. Just use the same index name and type if you want the results (so only the results of a query are indexed) in the same index.

laurensdv commented 10 years ago

Ok, thanks for number 1 and 3.

For number 2. I am looking to resolve queries like (e.g. after having indexed DBpedia)

http://dbpedia.org/resource/Rio_De_Janeiro ?p ?o or http://dbpedia.org/resource/Rio_De_Janeiro ?p http://dbpedia.org/resource/Brazil

comparative to what is possible with SIREn (another lucene based RDF indexer plugin, but that one uses SOLr): https://github.com/rdelbru/SIREn/wiki/NTriple-Query-Parser

If I have indexed a SPARQL query which contains many results, am I then also able to use other SPARQL queries (different ones) on that results - how performant would that be compared to querying the SPARQL endpoint directly?

iulia-pasov commented 10 years ago

For the second question, yes, you can do that with filters (by subject, predicate, object). Take a look at the Elasticsearch query dsl (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-queries.html) From what I have tested until now, there are queries that in SPARQL take a couple of seconds (15) and in Elasticsearch just milliseconds. Especially filters are a lot more efficient.

laurensdv commented 10 years ago

Thanks, could you give me some examples on these queries? When I execute a query now I typically use the text field.

{"query":{"bool":{"must":[{"text":{"resource.http://www.w3.org/1999/02/22-rdf-syntax-ns#about":"http://www.eea.europa.eu/no/themes/households/intro "}}],"must_not":[],"should":[]}},"from":0,"size":10,"sort":[],"facets":{}}

But this query also retrieves non-exact matches and I guess it does a text based search rather than an exact uri match. Which obviously includes unwanted scoring as there in fact can only be exact matches.

Documentation says I could use a filtered query to get rid of the scores, but then would I still use the text search?

{ "filtered" : { "query" : { "text" : { "resource.http://www.w3.org/1999/02/22-rdf-syntax-ns#about" : "http://www.eea.europa.eu/no/themes/households/intro " } }, "filter" : { //What to put here? If I put something here, is it still necessary to include stuff in the 'text' query } } } }

iulia-pasov commented 10 years ago

Take a look at the analyzers you have set. If you haven't set any, your text "http://www.eea.europa.eu/no/themes/households/intro" will be split by ":", "/","." and the results will be resources containing any of the divisions.

Take a look at http://centaurus-dev.eea.europa.eu/search/ There are a lot of predefined queries there, including filtered queries (e.g. if you set the 'undefined' value)

laurensdv commented 10 years ago

OK, thanks,

I noticed that probably the default analyzer is set, so for the fields that I know that have urls/uris and I don't want to be tokenized, what would be the recommended settings for this case in the elasticsearch.yml file? Or what kind of query should i send to the index to achieve these settings -XPOST {index : ... }?

pblin commented 10 years ago

I used river to index the RDF (in RDF/XML files and sparql endpoint) but i could not get the query to work. the text search always comes back as follows. Any hit.

"hits": { "total": 2, "max_score": 1, "hits": [ { "_index": "_river", "_type": "rdf_river", "_id": "_meta", "_score": 1, "_source": { "type": "eeaRDF", "eeaRDF": { "endpoint": "http://poc.scholastic-labs.io:8890/sparql", "query": [ "CONSTRUCT {?s ?p ?o} WHERE {?s a http://wso2.scholastic-labs.io/schcore#Product ; ?p ?o}" ], "queryType": "construct" } } }, { "_index": "_river", "_type": "rdf_river", "_id": "_status", "_score": 1, "source": { "ok": true, "node": { "id": "XcvkFT-SPOQsPc6VJ6Z8Q", "name": "Odin", "transport_address": "inet[/10.32.242.171:9300]" } } } ] } }

iulia-pasov commented 10 years ago

Hello, is this what you get for /_search? ? Your data should be indexed under /rdfdata/resource but you can always check the ES logs to see if there are any errors. Can you show me the messages from the ES log file?

pblin commented 10 years ago

I reindex using: curl -XPUT 'localhost:9200/_river/rdf_river/_meta' -d '{ "type" : "eeaRDF","eeaRDF" : { "urls" : ["http://localhost/PCD_DATA.rdf","http://localhost/PCD_Appeal_Level.rdf"]} }'

from /_search { "query": { "match_all": {} } }

the following is the result. does it look right? how do i do free text queries on properties?

{ "took": 165, "timed_out": false, "_shards": { "total": 6, "successful": 6, "failed": 0 }, "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "_river", "_type": "rdf_river", "_id": "_meta", "_score": 1, "_source": { "type": "eeaRDF", "eeaRDF": { "urls": [ "http://localhost/PCD_DATA.rdf", "http://localhost/PCD_Appeal_Level.rdf" ] } } }, { "_index": "_river", "_type": "rdf_river", "_id": "_status", "_score": 1, "_source": { "ok": true, "node": { "id": "x2vRKWx7RnS2tJvhefna5A", "name": "Box", "transport_address": "inet[/10.32.24.179:9300]" } } }, { "_index": "rdfdata", "_type": "stats", "_id": "1", "_score": 1, "_source": { "last_update": "2014-07-23T18:14:53" } } ] } }

pblin commented 10 years ago

Another question is: anyway to set the query parameters on SPAQL endpoint in the indexing POST statement, so queries will return in RDF/XML?

iulia-pasov commented 10 years ago

Your data should be in 'localhost:9200/rdfdata/resource' and you can visualize it with url: 'localhost:9200/rdfdata/resource/_search?pretty=1'. Can you show me the result? Take a look in there to see if the number of resources is the same as in the RDF file.

You can customise your endpoint to return RDF/XML instead of HTML or other formats. If you use CONSTRUCT queries, the endpoint will most likely return a file (XML, JSON, etc.). However, since the query is run internally, you do not need to specify a file type when querying an endpoint.

If there are problems with the endpoint, you can send me the URL and index query to check it out for you.

pblin commented 10 years ago

@iulia-pasov the result: { "took" : 0, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : null, "hits" : [ ] } }

I use both RDF urls and SPARQL. something that i need to configure? The SPQRL endpoint is Virtuoso, same as dbpedia.org. we can to have it return in RDF/XML by passing the "format=application/rdf+xml" parameter. any way to set the query parameters?

I stopped and restarted elasticsearch and got errors in the log: [2014-07-24 11:05:06,130][INFO ][gateway ] [Blue Diamond] recovered [2] indices into cluster_state [2014-07-24 11:05:06,215][DEBUG][action.get ] [Blue Diamond] [_river][0]: failed to execute [[_river][rdf_river][_meta]: routing [null]] org.elasticsearch.action.NoShardAvailableActionException: [_river][0] null at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.start(TransportShardSingleOperationAction.java:123) at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:72) at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:47) at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:61) at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:92) at org.elasticsearch.client.support.AbstractClient.get(AbstractClient.java:179) at org.elasticsearch.action.get.GetRequestBuilder.doExecute(GetRequestBuilder.java:112) at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85) at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59) at org.elasticsearch.river.routing.RiversRouter$1.execute(RiversRouter.java:109) at org.elasticsearch.river.cluster.RiverClusterService$1.run(RiverClusterService.java:103) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744)

iulia-pasov commented 10 years ago

It should be enough to set the query, query type and endpoint for the endpoint query or the list of URIs for the URI query. There is no need to set extra parameters if the endpoint is Virtuoso. I tried to query your endpoint (http://poc.scholastic-labs.io:8890/sparql) but I always end up with a timeout. Is there any way I can test your endpoint/URIs?

pblin commented 10 years ago

It is only available in our intranet.

From: Iulia Pasov [mailto:notifications@github.com] Sent: Thursday, July 24, 2014 11:13 AM To: eea/eea.elasticsearch.river.rdf Cc: Lin, Bernard Subject: Re: [eea.elasticsearch.river.rdf] Bulk Loading of large RDF file and matching by triple pattern (#7)

It should be enough to set the query, query type and endpoint for the endpoint query or the list of URIs for the URI query. There is no need to set extra parameters if the endpoint is Virtuoso. I tried to query your endpoint (http://poc.scholastic-labs.io:8890/sparql) but I always end up with a timeout. Is there any way I can test your endpoint/URIs?

— Reply to this email directly or view it on GitHubhttps://github.com/eea/eea.elasticsearch.river.rdf/issues/7#issuecomment-50031627.

pblin commented 10 years ago

the endpoint is only available on our intranet. I've try the query in your examples but did not work either. Just to verify: the elasticsearch version is 0.90.3 and the RDF river plugin is 1.4. are they right?

iulia-pasov commented 10 years ago

We identified a bug yesterday in RDF river plugin 1.4. It has been fixed, but please check if you have the last version. Also, if you run ES in the foreground, you should be able to see all the errors. Did you check if the plugin is installed? (bin/plugin -l | grep eea-rdf-river)

pblin commented 10 years ago

should i try 1.3 instead?

the error i got back after running one of the examples: Starting RDF harvester: endpoint [http://semantic.eea.europa.eu/sparql], query [[CONSTRUCT {?s ?p ?o} WHERE {?s a http://www.openlinksw.com/schemas/virtrdf#QuadMapFormat ; ?p ?o}, CONSTRUCT {?s ?p ?o} WHERE { ?s a http://www.eea.europa.eu/portal_types/AssessmentPart#AssessmentPart ; ?p ?o}]],URLs [[]], index name [rdfdata], typeName resource [2014-07-24 11:34:26,925][INFO ][cluster.metadata ] [Storm, Franklin] [_river] update_mapping rdf_river [2014-07-24 11:34:28,874][INFO ][river.eea_rdf.support ] Could not parse [[CONSTRUCT {?s ?p ?o} WHERE {?s a http://www.openlinksw.com/schemas/virtrdf#QuadMapFormat ; ?p ?o}, CONSTRUCT {?s ?p ?o} WHERE { ?s a http://www.eea.europa.eu/portal_types/AssessmentPart#AssessmentPart ; ?p ?o}]]. Please provide a relevant query com.hp.hpl.jena.query.QueryParseException: Encountered " "[" "[ "" at line 1, column 1. Was expecting one of: "base" ... "prefix" ... "select" ... "describe" ... "construct" ... "ask" ...

[2014-07-24 11:34:28,874][INFO ][river.eea_rdf.support ] Ended harvest for endpoint [http://semantic.eea.europa.eu/sparql], query [[CONSTRUCT {?s ?p ?o} WHERE {?s a http://www.openlinksw.com/schemas/virtrdf#QuadMapFormat ; ?p ?o}, CONSTRUCT {?s ?p ?o} WHERE { ?s a http://www.eea.europa.eu/portal_types/AssessmentPart#AssessmentPart ; ?p ?o}]],URLs [[]], index name rdfdata, type name resource

pblin commented 10 years ago

@iulia-pasov i tried the new release and only the SPARQL examples are working. The RDF files from URL and our SPARQL endpoints are not working.

For example: the indexing statement below for dbpedia does not work either: curl -XPUT 'localhost:9200/_river/rdf_river/_meta' -d ' { "type" : "eeaRDF", "eeaRDF" : { "endpoint" : "http://poc.scholastic-labs.io:8890/sparql", "query" : "SELECT ?s ?p ?o from http://scholastic-labs.io/spd WHERE {?s a http://wso2.scholastic-labs.io/schcore#Product . ?s ?p ?o}", "queryType" : "select" } }'

iulia-pasov commented 10 years ago

The error you got comes from the fact that in v1.3 the lists of queries were not supported. All you have to do is send only one query:

"query" : "CONSTRUCT {?s ?p ?o} WHERE {?s a http://www.openlinksw.com/schemas/virtrdf#QuadMapFormat ; ?p ?o}, CONSTRUCT {?s ?p ?o} WHERE { ?s a http://www.eea.europa.eu/portal_types/AssessmentPart#AssessmentPart ; ?p ?o}"

As for the other issue, it might be the case that your endpoint cannot be accessed from the ES server. Try this query: curl -XPUT 'localhost:9200/_river/dbp/_meta' -d ' { "type": "eeaRDF", "eeaRDF" : { "endpoint" : "http://dbpedia.org/sparql", "query": ["select ?s ?p ?o where {?s a http://dbpedia.org/ontology/Band . ?s ?p ?o} LIMIT 100"], "queryType" : "select" } }'

and check 'localhost:9200/rdfdata/resource/_search?pretty=1'

pblin commented 10 years ago

when i try: curl -XPUT 'localhost:9200/_river/dbp/_meta' -d ' { "type": "eeaRDF", "eeaRDF" : { "endpoint" : "http://dbpedia.org/sparql", "query": ["select ?s ?p ?o where {?s a http://dbpedia.org/ontology/Band . ?s ?p ?o} LIMIT 100"], "queryType" : "select" } }'

the result from 'localhost:9200/rdfdata/resource/_search?pretty=1' {"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

pblin commented 10 years ago

this is the result form _search { "query": { "match_all": {} } }

==>

pblin commented 10 years ago

it seems that only the endpoint http://semantic.eea.europa.eu/sparql is working.

iulia-pasov commented 10 years ago

Which version of the river plugin are you using right now? The dbpedia query was written for 1.4 so maybe that is why nothing was indexed. You need to restart ES after you install a new version of the river plugin and run the query again. I tested the plugin on several (5) endpoints with no errors so I cannot understand what you are doing wrong.

Can you show me the error you get for the dbpedia query? The one in the initial message, that you have later edited by removing the error message, comes from the fact that the plugin version 1.3 does not accept lists of queries. Try to remove all the data from rdfdata/resource, run the query again and check /rdfdata/rasource again. The query for 1.3 should be:

curl -XPUT 'localhost:9200/_river/dbp/_meta' -d '
{
"type": "eeaRDF",
"eeaRDF" : {
"endpoint" : "http://dbpedia.org/sparql",
"query": "select ?s ?p ?o where {?s a http://dbpedia.org/ontology/Band . ?s ?p ?o} LIMIT 100",
"queryType" : "select"
}
}'

pblin commented 10 years ago

i reinstalled ES 0.90.3 and the plugin 1.4. it is the dbpedia query is working now: curl -XPUT 'localhost:9200/_river/dbp/_meta' -d ' { "type": "eeaRDF", "eeaRDF" : { "endpoint" : "http://dbpedia.org/sparql", "query": ["select ?s ?p ?o where {?s a http://dbpedia.org/ontology/Band . ?s ?p ?o} LIMIT 100"], "queryType" : "select" } }'

but using the RDF urls is still not working. curl -XPUT 'localhost:9200/_river/rdf_river/_meta' -d '{ "type" : "eeaRDF", "eeaRDF" : { "urls" : ["http://dd.eionet.europa.eu/vocabulary/aq/individualexceedances/rdf", "http://dd.eionet.europa.eu/vocabulary/aq/pollutant/rdf", "http://dd.eionet.europa.eu/vocabulary/aq/naturalsourcetype/rdf", "http://dd.eionet.europa.eu/vocabulary/aq/measurementmethod/rdf"] } }' any suggestions?

iulia-pasov commented 10 years ago

There was indeed an error on URI harvest, but it was displayed in the logs. I fixed it now. Reinstall the last version of the plugin and it should be ok.

pblin commented 10 years ago

two questions: 1) how do I use fuzzy query for a certain term? For example I try { "fuzzy" : { "resource.http://purl.org/dc/elements/1.1/title" : "Less" } } but got errors.

2) The URI is still not working. The log is appended as follows. Maybe I miss something?

[2014-07-28 19:47:02,553][INFO ][node ] [Wicked] version[0.90.9], pid[29984], build[a968646/2013-12-23T10:35:28Z] [2014-07-28 19:47:02,554][INFO ][node ] [Wicked] initializing ... [2014-07-28 19:47:02,613][INFO ][plugins ] [Wicked] loaded [eea-rdf-river], sites [] [2014-07-28 19:47:07,639][INFO ][node ] [Wicked] initialized [2014-07-28 19:47:07,639][INFO ][node ] [Wicked] starting ... [2014-07-28 19:47:07,809][INFO ][transport ] [Wicked] bound_address {inet[/0.0.0.0:9300]}, publish_address {inet[/10.42.64.183:9300]} [2014-07-28 19:47:10,872][INFO ][cluster.service ] [Wicked] new_master [Wicked][3D4ERfRyTJCTq5-NlsSnpw][inet[/10.42.64.183:9300]], reason: zen-disco-join (elected_as_master) [2014-07-28 19:47:10,932][INFO ][discovery ] [Wicked] elasticsearch/3D4ERfRyTJCTq5-NlsSnpw [2014-07-28 19:47:10,969][INFO ][http ] [Wicked] bound_address {inet[/0.0.0.0:9200]}, publish_address {inet[/10.42.64.183:9200]} [2014-07-28 19:47:10,970][INFO ][node ] [Wicked] started [2014-07-28 19:47:11,045][INFO ][gateway ] [Wicked] recovered [0] indices into cluster_state [2014-07-28 19:53:10,422][INFO ][cluster.metadata ] [Wicked] [_river] creating index, cause [auto(index api)], shards [1]/[1], mappings [] [2014-07-28 19:53:11,063][INFO ][cluster.metadata ] [Wicked] [_river] update_mapping vocab [2014-07-28 19:53:11,089][INFO ][river.routing ] [Wicked] no river _meta document found, retrying in 1000 ms [2014-07-28 19:53:12,131][INFO ][river.eea_rdf.support ] Starting RDF harvester: endpoint [http://semantic.eea.europa.eu/sparql], queries [null],URLs [[]], index name [rdfdata], typeName [resource] [2014-07-28 19:53:12,135][INFO ][cluster.metadata ] [Wicked] [_river] update_mapping vocab

iulia-pasov commented 10 years ago

Regarding question 1) please check the ES Query DSL. Once the data is indexed, everything else is an ES issue. You might have problems from the points in the URIs, since ES might think the parts separated by them are different elements. To be sure, you can use the normalisation options to rename properties and try again.

Regarding question 2) I see there are no urls in the list. I found a typo in the README file. Use "uris" instead of "urls". I will commit the changes.

iulia-pasov commented 10 years ago

@pblin Is everything OK for you now?

pblin commented 10 years ago

@iulia-pasov the RDF files using the "uris" is still not working for me.

iulia-pasov commented 10 years ago

Can you post the error messages or the link to the URI?

pblin commented 10 years ago

i don't see any error in the log though. Just that the indices are not showing up.

[2014-07-31 13:25:22,666][INFO ][node ] [Arliss, Todd] version[0.90.3], pid[3887], build[5c38d60/2013-08-06T13:18:31Z] [2014-07-31 13:25:22,696][INFO ][node ] [Arliss, Todd] initializing ... [2014-07-31 13:25:23,512][INFO ][plugins ] [Arliss, Todd] loaded [eea-rdf-river], sites [] [2014-07-31 13:25:30,950][INFO ][node ] [Arliss, Todd] initialized [2014-07-31 13:25:30,966][INFO ][node ] [Arliss, Todd] starting ... [2014-07-31 13:25:31,441][INFO ][transport ] [Arliss, Todd] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.32.24.179:9300]} [2014-07-31 13:25:34,596][INFO ][cluster.service ] [Arliss, Todd] new_master [Arliss, Todd][cotOLRwoTGGnTym4Nn0uCw][inet[/10.32.24.179:9300]], reason: zen-disco-join (elected_as_master) [2014-07-31 13:25:34,897][INFO ][discovery ] [Arliss, Todd] elasticsearch/cotOLRwoTGGnTym4Nn0uCw [2014-07-31 13:25:35,007][INFO ][http ] [Arliss, Todd] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.32.24.179:9200]} [2014-07-31 13:25:35,008][INFO ][node ] [Arliss, Todd] started [2014-07-31 13:25:35,214][INFO ][gateway ] [Arliss, Todd] recovered [0] indices into cluster_state [2014-07-31 13:26:06,634][INFO ][cluster.metadata ] [Arliss, Todd] [_river] creating index, cause [auto(index api)], shards [1]/[1], mappings [] [2014-07-31 13:26:07,529][INFO ][cluster.metadata ] [Arliss, Todd] [_river] update_mapping rdf_river [2014-07-31 13:26:07,741][INFO ][river.eea_rdf.support ] Starting RDF harvester: endpoint [http://semantic.eea.europa.eu/sparql], queries [null],URLs [[]], index name [rdfdata], typeName [resource] [2014-07-31 13:26:07,749][INFO ][cluster.metadata ] [Arliss, Todd] [_river] update_mapping rdf_river

iulia-pasov commented 10 years ago

It seems that the river is not taking the URIs at all. Some questions:

Can you post the query in here? I'd like to test it myself to see if I can replicate the error.
Does the "Ending RDF harvest..." message appear in the ES logs?

pblin commented 10 years ago

I used your example.
No i did not see it.

iulia-pasov commented 10 years ago

Some URIs in the example were no longer available online and that is why the river crashes. I removed the broken URIs from the file. There will be a better error handling system in the new version.

hemed commented 9 years ago

Hi, Latest rdf river plugin (eea-rdf-river-plugin-1.4.2) is not compatible with ElasticSearch 1.3.4. How to make it compatible? I have tried to edit version number in the pom file but still no luck.

alecghica commented 9 years ago

@hemed this topic was not related to your question. Anyway, the compatibility for ElasticSearch 1.3.4 is on our short list roadmap so it will be implemented in the next weeks, just keep an eye on http://taskman.eionet.europa.eu/issues/18731

hemed commented 9 years ago

Thanks @alecghica for the information. We will be keeping an eye on that.

demarant commented 9 years ago

the original question has been answered by iulia. @hemed a new version of the rdf river for ES 1.4.2 is now also available and we also have better error handling @pblin so closing this question.

demarant commented 9 years ago

regarding the question about uploading large RDF files/large SPARQL queries which may end up in timeout, you can split your RDF in multiple files or optimise your queries. More tips are found on this wiki page

eea / eea.elasticsearch.river.rdf

Bulk Loading of large RDF file and matching by triple pattern #7