Open white-gecko opened 6 months ago
we have to think whether improves quality to disallow newlines in abstract because it is intended to be short and concise. but could be annoying for uploaders. however for descriptions it should definitely be supported.
hey @white-gecko, is it the input which also causes an error in gstore (the later stage), this one "virtuoso.jdbc4.VirtuosoException: SQ074: Line 38: SP030: SPARQL compiler, line 5: syntax error at '<' before 'https:'"
when you remove newlines from the version above?
Yes
ref #158
No this is not correct. The problem described here is different.
The problem described in this issue is that the multi line literals created from one of the two fields abstract
or description
:
"abstract": "Countries\n2023-09-18T12:22:14Z",
"description": "Countries\n2023-09-18T12:22:14Z",
is not represented correctly in N-Triples, i.e. the newlines are note encoded as \n
in the RDF literal but are represented as actual newlines, which is not allowed in N-Triples.
I was able to reproduce the bug, however the issue is something else. abstract/description can have newlines, the derivedFrom field having newlines seems to be the issue here.
Since this is a derivedFrom issue, this is linked to #158 (you were right @manonthegithub )
The current model specifies the value to be an uri: https://dbpedia.gitbook.io/databus/model/metadata/version#wasderivedfrom
Required fixes:
Are you sure it works with newlines in abstract and description?
Yes, it does work
A newline character in any URI will crash the current cluster node. The error happens in an async call in a third party library and can apparently not be caught within the Databus backend.
Currently, the sequence of processing inputs is:
I hoped that SHACL with nodekind:IRI would catch the error and changed the sequence to
This did not help but could be a solution if we specify a regex for each nodekind:IRI restriction that excludes any newlines.
Alternatively, I tried shuffling some function calls around in the Construct Query module. I converted the JSONLD input to quads before inserting into the in-memory-store. This process drops any URIs that are sketchy with a warning. This warning is about the URI not being absolute though.
{
event: {
type: [ 'JsonLdEvent' ],
code: 'relative @id reference',
level: 'warning',
message: 'Relative @id reference found.',
details: {
id: 'https://metadata.coypu.org/dataset/wikidata-distribution\n' +
'Wikidata Query Service\n' +
'https://query.wikidata.org/',
expandedId: 'https://metadata.coypu.org/dataset/wikidata-distribution\n' +
'Wikidata Query Service\n' +
'https://query.wikidata.org/'
}
},
next: [Function: next]
}
This is from the latest jsonld
js code that also powers the JSON-LD playground.
When bypassing the construct query issue, there will finally be a correct error message returned by Jena from the Gstore.
Saving dataset to janfo:coypu/countries/2023-09-18T122214Z/dataid.jsonld
StatusCodeError: 400 - {"message":"Wrong input data. SQ074: Line 22: syntax error. Error saving data, potentially caused by: \nBad IRI: <https://metadata.coypu.org/dataset/wikidata-distribution\nWikidata Query Service\nhttps://query.wikidata.org/> Spaces are not legal in URIs/IRIs."}
Fixing this in the backend cleanly turns out to be a bit tricky. It feels bad that the input passes all server side checks and then gets rejected by the database with the correct error.
I would be possible to add a small validator module that goes over all "@id" fields and checks the values for newlines.
will be fixed by doing earlier SHACL validation in the input processing, related to #167
Creating the following version fails:
with the output:
When I remove all newlines
\n
the parsing is successful, but fails at a later stage.