dbpedia / gstore

Git repo / triple store hybrid graph storage
Apache License 2.0
3 stars 0 forks source link

Storing Documents: Non JSON-LD Content dropped #38

Open JonathanJustavino opened 2 months ago

JonathanJustavino commented 2 months ago

Currently, when saving a document, content that cannot be parsed to RDF in jena is dropped. (e.g. JSON content that is not JSON-LD is dropped, before storing the file) It would be nice, if the git part of the gstore stores the original document, and any converted rdf content from the document is stored in the triple store.

manonthegithub commented 2 months ago

Here we need to discuss, this may lead to problems in Databus, when someone would post invalid document this would still be saved to git then, not cool, so we need to develop rules or maybe some additional param (like is_rdf_data=true)

holycrab13 commented 2 months ago

Not a problem for Databus, all inputs are validated on the Databus side first before saving

holycrab13 commented 2 months ago

Invalid documents should be rejected (bad JSON syntax) but all JSON is somewhat JSON-LD - even if it's just an empty graph. A long JSON document might still contain 2-3 triples as JSON-LD, we can't lose the entire rest of the document though.

Accept only if parsable of course, but Jena will usually just ignore anything that isn't LD in any JSON. Triple store then holds all triples, git has the full docs with unmodified content.

This is a hard requirement for the Gstore to serve as a database for MOSS and the OEF, not all of their JSON content is JSON-LD. Also all documents need to be saved "as is" - so after the validation, the JSON document should be saved as it came and not the JSON-LD printed out of the model.

manonthegithub commented 2 months ago

Ahhh so you want to store only jsons? I thought just any kind of file... Yes there is also some postprocessing of the jsonlds, they are stored in minimised format and not containing the full context also, but just URI for it.

manonthegithub commented 2 months ago

So that still will be json/jsonlds... I see now... this does not seem to be a problem

manonthegithub commented 2 months ago

So this would look like that: you have mixed json/jsonld, you want jsonld part saved to triple store and all together also to git.

if invalid json (and therefor the whole document is invalid) -> we reject the whole if jsonld part is invalid (wrong syntax/parser error) -> we reject the whole (save nowhere) if jsonld part is valid -> we save

This is how you want? @holycrab13

JJ-Author commented 1 month ago

in case you integrate this please really make it configurable/switchable and non-default. g-store is supposed to be a graph store with a simple git history of the graphs in git - not a json store^^.

@manonthegithub i think they ask to you store the json in git as is - so not stripping non-ld content and not normalizing it.

in case you accept json file that does not contain any ld -> so leads to no triple you need to think about the read (and delete ) api calls because this file is invisible from the sparql endpoint and also the read call would just return nothing. so how do you want to read the plain json @holycrab13 @JonathanJustavino?? i think you would need new api functions.

it is seems also interesting that there is no api function to get all files or at least the history of a file, but I guess we never needed it so far^^.

manonthegithub commented 1 month ago

@JJ-Author to see all files you can go to file browser which is included in gstore (it is at /file path) the other calls are not there yet, true :)

holycrab13 commented 1 month ago

So this would look like that: you have mixed json/jsonld, you want jsonld part saved to triple store and all together also to git.

if invalid json (and therefor the whole document is invalid) -> we reject the whole if jsonld part is invalid (wrong syntax/parser error) -> we reject the whole (save nowhere) if jsonld part is valid -> we save

This is how you want? @holycrab13

Yes, that's it.

@JJ-Author either /file, but the /g path should also just return the document imo. This would probably require to handle JSONLD differently than any other RDF syntax

manonthegithub commented 1 month ago

Hmmmm.... I had some more thought over it... It seems like having this feature is a dirty hacky solution for some particular problem which does not really fit into concept of gstore, but just in current case it is easier to implement like this. I mean gstore is not supposed to work with json differently as with the other formats, if we make this change we will have inconsistency in behaviour which will for sure lead to some problems in future. I would discuss in more detail why you want this and what are possible alternative solutions, my question is why can't we just convert this non-ld content to ld-content so that the consistency stays? (it may be some extension of DataId or some custom ontology)

manonthegithub commented 1 month ago

Why would you in the first place mix both formats is also not clear to me? Looks like a design issue. I would red flag this kinda things and better recommend to the guys doing it to think again about the design.

manonthegithub commented 1 month ago

we could potentially implement a method which will allow to store non-RDF content in gstore too (just saving in git), but separately. so that json or whatever stuff can be separated from RDF

manonthegithub commented 1 month ago

it can actually be the same method for save and read, just checking if its an rdf content or not and if yes then parse rdf and put to virtuoso, if not then just save to git and voila

holycrab13 commented 1 month ago

We have the case where the document is 20% RDF and the rest just JSON, so a hard separation won't work in this case. I think it would be best to save to git and then on the virtuoso side create a graph for the document and throw in whatever is parsable RDF. A non RDF document will just end up with an empty graph.

holycrab13 commented 1 month ago

Why would you in the first place mix both formats is also not clear to me? Looks like a design issue. I would red flag this kinda things and better recommend to the guys doing it to think again about the design.

While it's not great, it's still somewhat valid and all JSON-LD parsers can deal with it. I think there no real reason not to support it

manonthegithub commented 1 month ago

@holycrab13

We have the case where the document is 20% RDF and the rest just JSON, so a hard separation won't work in this case. I think it would be best to save to git and then on the virtuoso side create a graph for the document and throw in whatever is parsable RDF. A non RDF document will just end up with an empty graph.

I don't see a point why the separation won't work. Doesn't matter what % is which part, you don't do it manually. In the first place I would ask the guys not mix the formats, do they have any reasoning about it? Could you ask the guys to send RDF? they can convert non-RDF fields to RDF.

While it's not great, it's still somewhat valid and all JSON-LD parsers can deal with it. I think there no real reason not to support it

It is valid in a sense you get from gstore. If you claim the format to be jsonld then only json-ld is saved. If you claim format to be json then you save json, but it is not parsed as json ld. It is just a coincidence that json-ld is also a valid json.

One more thing then I could recommend if they still want to do it is to have a field containing the whole json ld-document and other fields. Then you could easily take the json-ld part and save it to gstore and for json we can create a new method which allows to save non-rdf data.

like that:

{
bla: "bla",
bla2: "bla2",
jsonld: <here is full json-ld object>

}

In general, I just think we can find a better solution than mixing the formats. I am quite certain if we do it now we will get some issues in the future. Better just to support some new formats...

If abovemnetioned not possible. One of the possible workaround solutions to that could be to make up a new custom media type for that and require to specify it. something like application/jsonld+json, then potentially we can make a separate implementation for working with this documents which will not interfere with normal json-lds.

holycrab13 commented 1 month ago

" It is just a coincidence that json-ld is also a valid json." not a coincidence, this is in the definition. Mixing is done often, doing all JSON-LD is not viable for the client, since there's a LOT of json fields. The formats ARE usually mixed, no need to make something up here. It does not interfere with anything JSON-LD, therefore I do not really see a problem with implementing it.

This is a hard requirement that we need for MOSS and DLR

JJ-Author commented 1 month ago

i think persisting the json makes especially sense when you have an external json-ld context that might change over time. then you can reapply the extended context later and rebuild/reload the graphs. we actually also had this use case in lod-geoss in the beginning but then the context became too complicated and we converted the json to rdf from code. but "syncing" and versioning that external context is cool but adds another complexity (that is probably not needed but shows that json-ld indeed is by nature very different to all other rdf formats.

JJ-Author commented 1 month ago

https://gstore-playground.tools.dbpedia.org/file/ does show an error. how do you list all files?

manonthegithub commented 2 weeks ago

I am still against converting gstore to docuement store, but here are proposed changes:

Key Changes

  1. Document Preservation: Save documents in their original format instead of converting to JSON-LD. (new)
  2. Automatic RDF Extraction: Extract RDF data upon saving and store it in Virtuoso. (was there)
  3. Change of API [Option 1] -> not supported by swagger (will not be well present in docs), but nicer:
    • Save Document and RDF:
      • Endpoint: /document/<path>
      • Method: POST
      • Action: Saves the document and extracts RDF to Virtuoso.
    • Retrieve Document:
      • Endpoint: /document/<path>
      • Method: GET
      • Action: Retrieves the raw document.
    • Retrieve RDF Graph of a document:
      • Endpoint: /graph/<path>
      • Method: GET
      • Action: Returns the extracted RDF graph.
  4. Change of API [Option 2] -> supported by swagger -> easier to use:
    • Save Document and RDF:
      • Endpoint: /document/save
      • Param: path = "\"
      • Method: POST
      • Action: Saves the document and extracts RDF to Virtuoso.
    • Retrieve Document:
      • Endpoint: /document/read
      • Param: path = "\"
      • Method: GET
      • Action: Retrieves the raw document.
    • Retrieve RDF Graph of a document:
      • Endpoint: /graph/read
      • Param: path = "\"
      • Method: GET
      • Action: Returns the extracted RDF graph.
manonthegithub commented 2 weeks ago

Option 2 for now seems better, we can switch to Option 1 if very much needed in future without much effort.

kurzum commented 2 weeks ago

Here is a slightly different phrased version of Option 2. I think, we can make the parameter ?uri but ?path+?prefix might also be ok, maybe better than ?repo+?path+?prefix

would look like this: /graph/save?uri=https://databus.dbpedia.org/adrian1703/20news/talk.religion.misc/18828/dataid.jsonld

PURE GRAPH MODE |   | BEHAVIOUR |   |   -- | -- | -- | -- | -- graph/read | ?uri=$GRAPHURI | returns parsed version |   |   graph/save | ?uri=$GRAPHURI | parses POST body and commits parsed version as JSON-LD |   |   graph/delete | ?uri=$GRAPHURI | deletes from GIT and drops graph from triple store |   |     |   |   |   |     |   |   |   |   DOC MODE |   |   |   |   graph/read | ?uri=$GRAPHURI | returns parsed version |   |   graph/write |   | DEACTIVATED |   |   graph/delete |   | DEACTIVATED |   |   doc/save | ?uri=$GRAPHURI | extracts graph from body, but commits body as is |   |   doc/read | ?uri=$GRAPHURI | returns GIT version |   |   doc/delete | ?uri=$GRAPHURI | deletes from GIT and drops graph from triple store |   |  
manonthegithub commented 2 weeks ago

it is repo + path + prefix, in previous message there are only changes, so repo and prefix they remain

Graph mode just won't be there, only doc mode. Graph mode is the current version.

Only uri won't work, because not pear then what part is prefix and repo. The uri can contain arbitrary many segments in the prefix part, path + prefix + repo is the best option

kurzum commented 2 weeks ago

Ok, right, so the prefix, repo part is answered. Open questions:

manonthegithub commented 2 weeks ago

Would it be easy to implement a config option for pure graph mode, doc mode? Maybe we don't need it right now.

No it is different systems, different approaches to work with data, we do either one or the other, not both. So we treat data as docs or graphs here is the choice.

Does the Doc Mode require a file ending?

Yes, it must be there, no other way to understand what kinda data it is (when reading by /graph/read), or we need to store metadata for that. If we decide to use metadata, then Content-Type may be used again.

Does the content-type on post need to match the file ending?

The content-type will be ignored, only Accept head for /graph/read will be used

kurzum commented 2 weeks ago

Would it be easy to implement a config option for pure graph mode, doc mode? Maybe we don't need it right now.

No it is different systems, different approaches to work with data, we do either one or the other, not both. So we treat data as docs or graphs here is the choice.

Hm, really? the underlying functionality is the same. GIT and Virtuoso just accept data. I would implement it with two different servlet/scalatra implementations and different web.xml and swagger. Depending on which one you start you get pure graph or doc mode. My question was how difficult it was code wise and I think, we should only do the doc mode for now, but implement it in a way that we can add a different servlet implementation later.

Does the Doc Mode require a file ending?

Yes, it must be there, no other way to understand what kinda data it is (when reading by /graph/read), or we need to store metadata for that. If we decide to use metadata, then Content-Type may be used again.

Ok, so graph/read would need this as internal input to select the parser. SELECT ?g {GRAPH ?g {?s ?p ?o} } on virtuoso is probably not implemented currently.

Does the content-type on post need to match the file ending?

The content-type will be ignored, only Accept head for /graph/read will be used

I meant on POST and doc/save this is where content-type is/should be sent by the client. Posting "Content-type: text/turtle" to ?path=file.jsonld will throw an error then? We would need a list of file endings to content-type then.

manonthegithub commented 2 weeks ago

I meant on POST and doc/save this is where content-type is/should be sent by the client. Posting "Content-type: text/turtle" to ?path=file.jsonld will throw an error then? We would need a list of file endings to content-type then.

the answer is the same as before. It will be ignored, I understood what you meant.

Hm, really? the underlying functionality is the same. GIT and Virtuoso just accept data. I would implement it with two different servlet/scalatra implementations and different web.xml and swagger. Depending on which one you start you get pure graph or doc mode. My question was how difficult it was code wise and I think, we should only do the doc mode for now, but implement it in a way that we can add a different servlet implementation later.

This is just super weird running different services in the same container, I won't do that. If you want to keep old gstore, we need just to fork repo, or make a special branch, that is it.

Ok, so graph/read would need this as internal input to select the parser. SELECT ?g {GRAPH ?g {?s ?p ?o} } on virtuoso is probably not implemented currently.

we should have one single source of truth/data, not many, and so far it was git, not virtuoso, that is why we parse the document, and not query from virtuoso

manonthegithub commented 2 weeks ago

one problem which may occur in the future. When we get the same media type/extension, like json but several ways rdf is stored there, then we won't be able to detect the right parser just by extension, this will need extra information about the parser

kurzum commented 2 weeks ago

I meant on POST and doc/save this is where content-type is/should be sent by the client. Posting "Content-type: text/turtle" to ?path=file.jsonld will throw an error then? We would need a list of file endings to content-type then.

the answer is the same as before. It will be ignored, I understood what you meant.

Please answer with enough detail. It still sounds like you will implement a connection reset/connection time out. But I am asking about HTTP status code and what causes it, e.g. ".jsonld" in URI will trigger the use of JSON-LD parser, if body doesn't parse then 400 Bad Request" is that it?

Hm, really? the underlying functionality is the same. GIT and Virtuoso just accept data. I would implement it with two different servlet/scalatra implementations and different web.xml and swagger. Depending on which one you start you get pure graph or doc mode. My question was how difficult it was code wise and I think, we should only do the doc mode for now, but implement it in a way that we can add a different servlet implementation later.

This is just super weird running different services in the same container, I won't do that. If you want to keep old gstore, we need just to fork repo, or make a special branch, that is it.

we can make two docker containers. I totally don't care if this would be in different branches.

Ok, so graph/read would need this as internal input to select the parser. SELECT ?g {GRAPH ?g {?s ?p ?o} } on virtuoso is probably not implemented currently.

we should have one single source of truth/data, not many, and so far it was git, not virtuoso, that is why we parse the document, and not query from virtuoso

The main purpose of virtuoso is to query the graph data, it is hard for me to really think of the docs being the only way, we are allowed to use to get graph data. also a) it should be consistent, not eventually consistent, b) editing is on the doc, so there SSoT is not violated. even before in pure graph mode there were two synchronized SSoT which was the idea behind GSTORE.

kurzum commented 2 weeks ago

one problem which may occur in the future. When we get the same media type/extension, like json but several ways rdf is stored there, then we won't be able to detect the right parser just by extension, this will need extra information about the parser

file endings are our convention any how as there are no standard file endings, just media-types. so doing a list "file-ending"-> "parser" on each deployment would be enough.

manonthegithub commented 2 weeks ago

Please answer with enough detail. It still sounds like you will implement a connection reset/connection time out. But I am asking about HTTP status code and what causes it, e.g. ".jsonld" in URI will trigger the use of JSON-LD parser, if body doesn't parse then 400 Bad Request" is that it?

Formulate then the question with enough detail, what exactly you want to know (e.g. status codes etc), mention everything. A am not reading your mind. Really annoying.

I don't know how is that not clear, just don't understand. I get really annoyed, as it looks like trolling.

Content-Type is ignored means it is not checked or used anywhere in the code. how is that not clear? The extension will be used for determining the parser. how is that not clear? Depending on the parser response and following process the status code will be determined. It is the same as it was before, nothing changes. What else do you need?

Please think a little bit with your own head before asking, or ask chat got to give explanations of my responses.

we can make two docker containers. I totally don't care if this would be in different branches.

It should be clear that I mean servlet container, not docker containers. Again please formulate things you want precisely. You did not want to ask is that is much work, but you actually want it now. So you just would like to keep both of them. Then just mention that explicitly.

Here we can also tag last commit so far in gstore and that's it for now.

file endings are our convention any how as there are no standard file endings, just media-types. so doing a list "file-ending"-> "parser" on each deployment would be enough.

every media type has standard file ending, some of them have several this is for the future, atm there is no problem with that. If we make it a convention then it's not a problem anymore

manonthegithub commented 2 weeks ago

Would it be easy to implement a config option for pure graph mode, doc mode? Maybe we don't need it right now.

@kurzum I thought about this option a bit more, and this also actually makes sense, if we keep two frontends but same shared code base in deeper logic, this also works. Now I don't know what solution is actually better, forking/tagging or keeping both together... Both are actually valid. I will the main part first and then we may decide to have the second service as well as extra feature...

manonthegithub commented 1 week ago

41

During development I found out that I am using the Scalatra in a bit non expected way when uploading documents, so to support binary data and really large files (or big load), we will have to change the api again for /document/save. we will need to use multipart form data parameters (that is how it is supposed to work with files in Scalatra..., I may investigate bit more on that, but looks like that). This won't be a major change in the code, but will affect api (using multipart instead of classic post).

manonthegithub commented 5 days ago

Should work now, can be tested. @holycrab13 @JonathanJustavino