dbpedia / databus

A digital factory platform for managing files online with stable IDs, high-quality metadata, powerful API and tools for building on data: find, access, make interoperable, re-use
Apache License 2.0
36 stars 16 forks source link

Create version with newline in the description and abstract fails. #156

Open white-gecko opened 6 months ago

white-gecko commented 6 months ago

Creating the following version fails:

  "@context": "https://databus.coypu.org/res/context.jsonld",
  "@graph": [
      "@id": "https://databus.coypu.org/narndt/coypu",
      "@type": "Group",
      "title": "CoyPu"
      "@id": "https://databus.coypu.org/narndt/coypu/countries",
      "@type": "Artifact",
      "title": "Countries",
      "abstract": "Counties and regions",
      "description": "Counties and regions"
      "@type": [
      "@id": "https://databus.coypu.org/narndt/coypu/countries/2023-09-18T122214Z",
      "hasVersion": "2023-09-18T122214Z",
      "title": "Countries",
      "abstract": "Countries\n2023-09-18T12:22:14Z",
      "description": "Countries\n2023-09-18T12:22:14Z",
      "license": "https://dalicc.net/licenselibrary/Cc010Universal",
      "wasDerivedFrom": "https://metadata.coypu.org/dataset/wikidata-distribution\nWikidata Query Service\nhttps://query.wikidata.org/",
      "distribution": [
          "@type": "Part",
          "formatExtension": "ttl",
          "compression": "none",
          "downloadURL": "https://databus.coypu.org/dav/narndt/coypu/countries/2023-09-18T122214Z/countries_freqency=static.ttl",
          "dcv:frequency": "static"

with the output:

PROTECT Authenticated request by narndt: /api/publish?fetch-file-properties=true&log-level=debug
GET /res/context.jsonld 200 1.805 ms - 3490
GET /res/context.jsonld 200 1.805 ms - 3490
Found 1 group graphs.
Processing group <https://databus.coypu.org/narndt/coypu>
2 triples selected via construct query.
Input has been processed by the auto-completer
SHACL validation successful
Context has been resubstituted with <https://databus.coypu.org/res/context.jsonld>
Saving group <https://databus.coypu.org/narndt/coypu> to narndt:coypu/group.jsonld
Found 1 artifact graphs.
Processing artifact <https://databus.coypu.org/narndt/coypu/countries>
4 triples selected via construct query.
Input has been processed by the auto-completer
SHACL validation successful
Context has been resubstituted with <https://databus.coypu.org/res/context.jsonld>
Saving artifact <https://databus.coypu.org/narndt/coypu/countries> to narndt:coypu/countries/artifact.jsonld
Found 1 version graphs.
Processing version <https://databus.coypu.org/narndt/coypu/countries/2023-09-18T122214Z>
Detected CV-graphs

      throw new JsonLdError(
JsonLdError [jsonld.ParseError]: Error while parsing N-Quads; invalid quad.
    at _parseNQuads (/databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:6964:13)
    at /databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:4236:20
    at work (/databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:3932:14)
    at Normalize.doWork (/databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:3944:5)
    at /databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:3993:10
    at /databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:3982:9
    at work (/databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:3932:14)
    at Normalize.doWork (/databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:3944:5)
    at iterate (/databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:3981:19)
    at /databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:3935:9
    at iterate (/databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:3985:5)
    at /databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:4223:13
    at /databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:3982:9
    at work (/databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:3932:14)
    at Normalize.doWork (/databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:3944:5)
    at iterate (/databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:3981:19)
    at /databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:4223:13
    at /databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:3982:9
    at work (/databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:3932:14)
    at Normalize.doWork (/databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:3944:5)
    at iterate (/databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:3981:19)
    at /databus/server/node_modules/rdfstore/node_modules/jsonld/js/jsonld.js:4223:13 {
  details: { line: 8 }

When I remove all newlines \n the parsing is successful, but fails at a later stage.

JJ-Author commented 6 months ago

we have to think whether improves quality to disallow newlines in abstract because it is intended to be short and concise. but could be annoying for uploaders. however for descriptions it should definitely be supported.

manonthegithub commented 6 months ago

hey @white-gecko, is it the input which also causes an error in gstore (the later stage), this one "virtuoso.jdbc4.VirtuosoException: SQ074: Line 38: SP030: SPARQL compiler, line 5: syntax error at '<' before 'https:'" when you remove newlines from the version above?

white-gecko commented 6 months ago


manonthegithub commented 6 months ago

ref #158

white-gecko commented 6 months ago

No this is not correct. The problem described here is different.

The problem described in this issue is that the multi line literals created from one of the two fields abstract or description:

      "abstract": "Countries\n2023-09-18T12:22:14Z",
      "description": "Countries\n2023-09-18T12:22:14Z",

is not represented correctly in N-Triples, i.e. the newlines are note encoded as \n in the RDF literal but are represented as actual newlines, which is not allowed in N-Triples.

holycrab13 commented 5 months ago

I was able to reproduce the bug, however the issue is something else. abstract/description can have newlines, the derivedFrom field having newlines seems to be the issue here.

holycrab13 commented 5 months ago

Since this is a derivedFrom issue, this is linked to #158 (you were right @manonthegithub )

The current model specifies the value to be an uri: https://dbpedia.gitbook.io/databus/model/metadata/version#wasderivedfrom

Required fixes:

white-gecko commented 5 months ago

Are you sure it works with newlines in abstract and description?

holycrab13 commented 5 months ago

Yes, it does work

holycrab13 commented 5 months ago

A newline character in any URI will crash the current cluster node. The error happens in an async call in a third party library and can apparently not be caught within the Databus backend.

Currently, the sequence of processing inputs is:

I hoped that SHACL with nodekind:IRI would catch the error and changed the sequence to

This did not help but could be a solution if we specify a regex for each nodekind:IRI restriction that excludes any newlines.

Alternatively, I tried shuffling some function calls around in the Construct Query module. I converted the JSONLD input to quads before inserting into the in-memory-store. This process drops any URIs that are sketchy with a warning. This warning is about the URI not being absolute though.

  event: {
    type: [ 'JsonLdEvent' ],
    code: 'relative @id reference',
    level: 'warning',
    message: 'Relative @id reference found.',
    details: {
      id: 'https://metadata.coypu.org/dataset/wikidata-distribution\n' +
        'Wikidata Query Service\n' +
      expandedId: 'https://metadata.coypu.org/dataset/wikidata-distribution\n' +
        'Wikidata Query Service\n' +
  next: [Function: next]

This is from the latest jsonld js code that also powers the JSON-LD playground.

When bypassing the construct query issue, there will finally be a correct error message returned by Jena from the Gstore.

Saving dataset to janfo:coypu/countries/2023-09-18T122214Z/dataid.jsonld
StatusCodeError: 400 - {"message":"Wrong input data. SQ074: Line 22: syntax error. Error saving data, potentially caused by: \nBad IRI: <https://metadata.coypu.org/dataset/wikidata-distribution\nWikidata Query Service\nhttps://query.wikidata.org/> Spaces are not legal in URIs/IRIs."}

Fixing this in the backend cleanly turns out to be a bit tricky. It feels bad that the input passes all server side checks and then gets rejected by the database with the correct error.

I would be possible to add a small validator module that goes over all "@id" fields and checks the values for newlines.

holycrab13 commented 4 months ago

will be fixed by doing earlier SHACL validation in the input processing, related to #167