Questions/Remarks on Use Cases

xristy commented 5 years ago

This was originally posted by @codam at the end of issue #1. I moved to a new issue for clarity. This issue refers to Editing_Service_Use_Cases.

Trying to catch up with you folks, here are a few questions and remarks. Let me know if this issue topic is a good place for that!

Why does the user need to be able to save his edits without applying them?

Is it a user request / a common use case that they have?
Isn't there a risk that the data they are applying his changes to would have changed before running his edits?
wouldn't it be simpler for both the user and the technical team to run/apply the changes immediately?

Vocabulary: why not rename:

"Task" to "Batch / Total Batch / Full Batch", as "task" brings an idea of a single action. Batch has the idea of multiple actions
"Session" to "Session Batch", as "session" can be confusing as in web it also describes the connection session of the user
I'm not attached to that, it's just ideas to make sure the wording stays accurate

Locks

is the lock created when a user starts to edit a resource?
when is it released?
- at the end of the session?
- after the batch of tasks has been run?

What does "a referred to resource" mean in "Create a resource and view a referred to resource"?

Editing in several sessions > when ES receives the request the task is saved in a local storage area for future reference.

does "local storage" refers to the local storage of the EC?
if that's the case, there will be an issue if the user uses 2 different computers to do the edits

Running the task:

it is mentioned that it should use POST in one part, and PUT in another. Which one is right?
Why not use HTTP PATCH method instead?
How will the server know in which order to apply multiple sessions? Are they all recorded in the same task file?

Originally posted by @codam in https://github.com/buda-base/editserv/issues/1#issuecomment-494647531

xristy commented 5 years ago

Why does the user need to be able to save his edits without applying them?

Because they currently use a standalone desktop editing app that allows them to work on various cataloging tasks over a period of time while working on other tasks. They have the luxury of local storage on their machines that allows them to work over relatively extended periods. With a browser-based system we either rely on some sort of browser local storage or use the ES to manage the storage of task data which we want to capture in any event. This then provides the opportunity for users to resume their work on different machines at different times.

Browser-based local storage does not seem to be uniformly implemented in a consistent manner across the major browsers so doesn't seem applicable at this point.

We do want to eventually support disconnected operation where the user is able to edit without being connected to the internet for extended periods of time but that is a much too ambitious for now.

Is it a user request / a common use case that they have?

It's a common use case - see above.

Isn't there a risk that the data they are applying his changes to would have changed before running his edits?

Not with the lock. Perhaps there needs to be more discussion in the doc, and this will be spelled out in the API spec

wouldn't it be simpler for both the user and the technical team to run/apply the changes immediately?

Perhaps but that's not how the librarians currently work since some tasks take more than a short time. It's much like coding. You don't get it all done at once before committing/pushing changes and you often end up doing more work on the task in future commits. Deciding when to publish to the production public site may be after several sessions within a task.

xristy commented 5 years ago

Vocabulary: why not rename:

"Task" to "Batch / Total Batch / Full Batch", as "task" brings an idea of a single action. Batch has the idea of multiple actions

"Session" to "Session Batch", as "session" can be confusing as in web it also describes the connection session of the user

I'm not attached to that, it's just ideas to make sure the wording stays accurate

We decided yesterday to try for a neutral vocabulary w.r.t. rdf-delta or other possible underlying implementations.

I chose Task since when a librarian is doing editing it is in the context of some task such as cataloging the latest works from Nepal and so on. Batch doesn't seem to capture it for me. For me Task connotes a single temporally extended activity of one to many steps.

TaskSession is plausible and in practice should correlate w/ a connection session for the user.

xristy commented 5 years ago

Locks

is the lock created when a user starts to edit a resource?

As indicated in the use case document, EC is responsible for requesting ES for a lock on a resource when the user first attempts an editing operation. This is lazy or optimistic locking - waiting until the last moment to attempt to get the lock - which in the current system is exceedingly rarely an issue. EC could request a lock as soon as the user indicates that they want to view a resource possibly for editing - this is what is done on the current system and would be pessimistic or eager locking.

when is it released?

at the end of the session?

Definitely not! Locks must be retained for the life of a Task.

after the batch of tasks has been run?

Yes. It is the responsibility of ES to clear all locks held for a Task once the Task is disposed of via run or via drop (which didn't get mentioned in the use case doc, users currently can checkout resources and then release them)

xristy commented 5 years ago

What does "a referred to resource" mean in "Create a resource and view a referred to resource"?

For example, the user is cataloging a work, aWorkId, and needs to refer to the author via the resource ID, aPersonId, for that person and wants to also view the person resource so referred to resource in this instance refers to the resource identified by aPersonId.

Often the librarian knows exactly which person resource they are wanting to refer to and has no interest in viewing the resource; however, sometimes they do want to. Especially as we are supporting a quite sophisticated bibliographic model with works that expressions of other works and that have manifestations (in the English sense not the French) that I expect will engender many opportunities to view referred to resources.

xristy commented 5 years ago

Editing in several sessions > when ES receives the request the task is saved in a local storage area for future reference.

does "local storage" refers to the local storage of the EC?

if that's the case, there will be an issue if the user uses 2 different computers to do the edits

The local storage refers to storage on the system where the ES is running not on the EC local machine. I used the adjective local to mean under the control of ES as opposed to for example AWS cloud storage or the Fuseki db and so on.

The phrase could be reworded as:

when ES receives the request, then ES saves the task data in a storage area local to the ES, for future reference.

xristy commented 5 years ago

Running the task:

it is mentioned that it should use POST in one part, and PUT in another. Which one is right?

POST means to create a new resource in a collection. The URL http://purl.bdrc.io/tasks can be thought of as a reference to the collection of tasks that are managed by ES, and POST http://purl.bdrc.io/tasks/taskID?save means to create the Task identified by taskID.

Whereas, PUT means to update an item in a collection, so when EC saves a resumed Task it uses PUT http://purl.bdrc.io/tasks/taskID?save

Why not use HTTP PATCH method instead?

I don't know. It may be that the PATCH type HTTP request is appropriate in some part of the editing service api

How will the server know in which order to apply multiple sessions? Are they all recorded in the same task file?

Essentially a Task is a sequence of Sessions so ES will run a Task by starting at the beginning and processing step-by-step.

codam commented 5 years ago

Vocabulary: why not rename:

"Task" to "Batch / Total Batch / Full Batch", as "task" brings an idea of a single action. Batch has the idea of multiple actions

"Session" to "Session Batch", as "session" can be confusing as in web it also describes the connection session of the user

I'm not attached to that, it's just ideas to make sure the wording stays accurate

We decided yesterday to try for a neutral vocabulary w.r.t. rdf-delta or other possible underlying implementations.

I chose Task since when a librarian is doing editing it is in the context of some task such as cataloging the latest works from Nepal and so on. Batch doesn't seem to capture it for me. For me Task connotes a single temporally extended activity of one to many steps.

TaskSession is plausible and in practice should correlate w/ a connection session for the user.

That totally makes sense. I'm ok with what you think makes more sense.

Editing in several sessions > when ES receives the request the task is saved in a local storage area for future reference.

does "local storage" refers to the local storage of the EC?

if that's the case, there will be an issue if the user uses 2 different computers to do the edits

The local storage refers to storage on the system where the ES is running not on the EC local machine. I used the adjective local to mean under the control of ES as opposed to for example AWS cloud storage or the Fuseki db and so on.

The phrase could be reworded as:

when ES receives the request, then ES saves the task data in a storage area local to the ES, for future reference.

Ok great. That would be great to reword it as "Local Storage" is a concept also existing in the front-end world, with a specific Javascript method having this name, cf. Web Storage API

MarcAgate commented 5 years ago

The use cases doc says:

EC then makes a GET http://purl.bdrc.io/graph/theID request for a json-ld serialization of the resource.

However,

1) There is no such endpoint available on ldspdi 2) Since the beginning of the project, we have had a single endpoint http://purl.bdrc.io/resource/theID delivering serialization for ALL the resources available that make our dataset. A single endpoint could be used because all resources were prefixed by bdr: (i.e http://purl.bdrc.io/resource/) 3) With the refactoring of our model (so it conforms to rfc011), we have now several "categories" of resources, differentiated by different prefixes. So now, delivering serialization of all resources on the basis of their various URI means that we must have an endpoint corresponding to each resource prefix.

Ldspdi must obviously serve resources according to their URI so I guess there's no other solution than implementing new endpoints. Do we have other resources prefix than bda: and bdr: ? (I hope I am gonna be able to use some wildcards in the endpoint, as I do with the ontology service)

xristy commented 5 years ago

yes there's @prefix bdg: <http://purl.bdrc.io/graph/> . as discussed in rfc011

MarcAgate commented 5 years ago

Is a named graph considered as being a rdf resource ? (or, can a named graph be the subject of a rdf triple? Let me know if my question is plain stupid...)

xristy commented 5 years ago

We do treat a named graph as a resource in adm:adminForGraph with range

adm:Graph rdfs:subClassOf rdfs:Resource

this property appears in the root adm:AdminData and serves to record the graphURI from within the graph. In some contexts one might not have a handle on the graphURI and we don't want to rely on the idea of replacing the namespace (or equiv. the prefix, bdr: or bda:) by the graph namespace since that identifier convention may be changed in the future and we prefer a semantic approach to recording the graphURI.

We know the sparql construct to retrieve the graph given its name:

construct { ?s ?p ?o . }
where {
    graph <someGraphUri> { ?s ?p ?o . }
}

and via fuConn.fetch(dsg, someGraphUri)

So there's a well-defined meaning for

http://purl.bdrc.io/graph/someGraphID

of course how you mechanize the graph endpoint is perhaps not as simple.

MarcAgate commented 5 years ago

Thanks! We already use that query in https://github.com/buda-base/lds-queries/blob/master/public/Resgraph.arq (remember our "describe vs. complete graph" discussion) Meanwhile, I found a way to have a generic and single endpoint for serving these new prefixed resources.

xristy commented 5 years ago

Yes I remember but I don't know who else might be interested and wasn't in our discussion

codam commented 5 years ago

In the Use case doc, there is in the edit part: "ES receives the task and then ES saves the task data in a storage area local to ES, and then ES runs the task - which consists of updating the dataset on Fuseki and the appropriate local git repo and perhaps pushes to the public repo."

Why do we store the dataset in a git local / public repo ? Is it just for backup or is it for other reasons?

codam commented 5 years ago

In the Use case doc, there is in the create part: "The user requests EC to setup up the editing UI for creating a new resource of a specified type. EC must create a new resource ID, theID - the current idea is that EC will generate a UUID or other hash and prefix it with the usual W, P, etc depending on the type."

where can I can find a definition of the usual prefixes?
after generating the ID, the EC should probably check that this ID is not already used in the database

codam commented 5 years ago

In the Use case doc, there is in the edit part: "the user makes edits involving adding, deleting or updating existing information, and the EC records each action as a sequence of patch editing A(add) and D(delete) steps. Each A or D consists of a quad of "subject property object graph". In this case the graph is that for the resource being edited, http://purl.bdrc.io/graph/theID."

Why should be using a quad, specifiying the graph, knowing that the graph will be the same for all quads? Is it to stay aligned with RDF Delta?
we could instead use simplified triples like: "subject property object", knowing that the graph would still be present in the header of the task: "H graph http://purl.bdrc.io/graph/theID ."
the same would work for the creation of a new resource, using the create header to know the graph: "H create http://purl.bdrc.io/graph/theID ."

xristy commented 5 years ago

=> where can I can find a definition of the usual prefixes?

Hmmm. They aren't collected anywhere. What is needed is a property on the ontology like:

:idPrefix  :  owl:Class  ==>  xsd:string

for example:

 :Work  :idPrefix  "W" .

for example:

:Event  :idPrefix  "EV" .

and so on.

xristy commented 5 years ago

Why do we store the dataset in a git local / public repo ? Is it just for backup or is it for other reasons?

Backup, plus git provides a way to keep track of sessions, every http://edit.bdrc.io/tasks{ save, run, drop } marks the end of a session and is saved as a version in git. This repo will retain every task unless there is some administrative reason to delete a task from the repo

xristy commented 5 years ago

after generating the ID, the EC should probably check that this ID is not already used in the database

The idea of the UUID is that that won't be necessary. ES will check the top-level ids mentioned in the graph and create headers. There are lots of ids in a single graph.

However, it is likely best for each id to be minted on ES as @MarcAgate suggested at one point. Maybe ES can hand out blocks of UUIDs so there doesn't need to be a round-trip per id.

xristy commented 5 years ago

Why should be using a quad, specifiying the graph, knowing that the graph will be the same for all quads? Is it to stay aligned with RDF Delta?

it will be typical for a single task to involve several graphs, for example a Work, an Item, and a Person. So each A and D needs to indicate which graph is involved.

eroux commented 5 years ago

I think a good exercise would be to document all the functions we need in a swagger document, I've started one in https://github.com/buda-base/editserv/blob/master/buda-edit-api.yml

you can just copy-paste it in

http://editor.swagger.io/

to see the endpoints and make sure you have the good syntax. I only defined one endpoint that can list the tasks and create new ones. It's fairly easy and I think it would help the implementation.

MarcAgate commented 5 years ago

The doc says:

Task - is a collection of edits, by a user, to one or more Resources, some of which may already exist on the public Fuseki, others of which are to be created

Let's say that a given task (T) creates two new resources (R1 and R2) and update a third one (R3), the two first being for instance a Person and a Place, the third one a Work. In such a case, there will be a need for the server to know all the various resource types and their exact correspondance (i.e a mapping R1/resType1, R2/resType 2, etc...). The resource type is needed for getting or creating the GitRepo of each Resource in the task. What could be the best way to get that info? Should we integrate it into the patch itself through custom headers or should we have a separate json node in the request body? I would favor a new custom header, in this vein :

H mapping R1-typeA,R2-typeF, etc....

Also, as a remark: it seems that the git repo update will require parsing the patch in order to extract all the commands (A or D) pertaining to a given resource (or graph).

To avoid such a parsing, and in the case of an update, I thought of loading the (old) trig file of the git repo into a Model/graph and then apply the patch to it, then serialize the result as a trig file, but it seems like all the triples related to other resources in the patch will be added anyway... I am still investigating that part but any suggestion will be appreciated.

eroux commented 5 years ago

For the final part of your question it should be quite easy to split the patch in different parts, each related to one specific graph, the apply these parts to the various .trig files. Or is there a difficulty there?

xristy commented 5 years ago

The header approach seems reasonable.

As Élie suggests, it seems that loading any graphs to be updated from the git repo into separate named graphs into a DSG would be reasonable and then simply execute the A's and D's against the various named graphs and serializing each graph and committing into the repo makes sense to me also.

MarcAgate commented 5 years ago

What Chris describes is quite different from Élie's proposal which suggests to split the patch. Thinking of it, Chris solution is actually the equivalent of what we already do when applying the patch to the fuseki dataset:

1) we create in memory a dataset made of all the graphs being patched by the task 2) We apply the patch of this task to this dataset 3) We successively get each graph from the patched dataset and push them back individually to fuseki

As I understand it, Chris solution is actually the same applied to the git repo as we would actually get each graph from the patched dataset (initially built from the trig files of the git repo) and then serialize them as individual trig file, committed back to the git repo. It seems to me a good solution as it doesn't require any constraint on patch layout nor imply any parsing of the patch itself, besides headers analysis (which is already built in the RDF delta API)

This is in fact equivalent to an implementation of the run() method of the PatchService (https://github.com/buda-base/editserv/blob/master/src/main/java/io/bdrc/edit/service/PatchService.java#L118) using a git repo instead of Fuseki as a datasource.

This being said, the question of passing a map of ResId/resourceType is still open despite a possible solution using a mapping header. I'll go with that until we find a better solution, if needed.

buda-base / editserv

Questions/Remarks on Use Cases #2