LibreCat / Catmandu-RDF

Catmandu modules for working with RDF data
https://metacpan.org/release/Catmandu-RDF
Other
5 stars 6 forks source link

Harden the RDF importer against bad data #21

Open phochste opened 9 years ago

phochste commented 9 years ago

I'm trying to import large datasets with Catmandu::Importer::RDF but the processing stops at every syntax error in the data. In case of SPARQL endpoints the world isn't perfect, bad encodings, spaces in URI-s can happen. Currently there is no possibility at the client side to fix these errors on the fly, or ignore them.

Concrete, I do these calls in a loop to extract data from a endpoint

   my $importer = Catmandu->importer('RDF', url => $url , sparql => $sparql , sparql_result => 'aref'); 
   $importer->each(sub {  #MAGIC } );

The internal RDF::aREF decoder in Catmandu::Importer::RDF croaks , #MAGIC can't catch the error can decide what to do, all processing stops.

nichtich commented 9 years ago

I wonder what invalid JSON, XML, CSV in other imports do but anyway... Errors could be ignored for single triples (option --triples) and/or for single SPARQL results (RDF::Query::VariableBindings or RDF::Trine::VariableBindings) is this what you propose?

phochste commented 9 years ago

Catmandu::OAI should have the same question.

With errors in a local N3, Turtle or RDF/XML dump you have a chance to open Vi and try to fix things. In every big RDF dump I've downloaded the last weeks I see syntax errors, be it from Europeana, VIAF, you name it. A couple dozen mistakes stops the processing of millions of triples. For downloads via SPARQL or LDF this is much harder. The data streams into the application and gets parsed before I have a chance to fix it. The download stops in the middle. Most of the errors a are due to encoding bad data into aREF format: generating an IRI from bad data (things which include tabs, spaces, newlines). Could there be a way to ignore bad triples in the --triples case and bad responses in the SPARQL case? Or another way?

phochste commented 9 years ago

I've added a mock-http-test branch to test sparql responses. But requires two test dependencies (also found in RDF::Trine btw Encoding and Test::LWP::UserAgent).