eclipse-pass / main

Catch all repository against which issues of general, cross cutting topics are logged.
Apache License 2.0
4 stars 8 forks source link

Add dates to PASS model #150

Open birkland opened 3 years ago

birkland commented 3 years ago

Dates (created, last modified) are present in Fedora, but are not universally part of the PASS model, and therefore cannot be queried by Elasticsearch. Figure out how to make created and last modified dates queriable, so that it's possible to ask for "Funders added in the last month", for example.

Figure out if it makes sense to merely add dates to the index, or if the PASS model itself needs to be updated (making dates first class citizens in the model).

Some considerations and nuances:

markpatton commented 3 years ago

Some background:

Options:

birkland commented 3 years ago

@markpatton Would you be able to weigh the pros and cons of the various approaches, and provide a recommendation?

markpatton commented 3 years ago

@birkland Will do.

emetsger commented 3 years ago

Figure out if it makes sense to merely add dates to the index, or if the PASS model itself needs to be updated (making dates first class citizens in the model).

I will add that to me it doesn't make sense to not add dates to the formal PASS model.

Querying ES say, for the most recent 10 submissions, and not exposing their dates in the model doesn't allow me to sort on, store, or range over the date locally using the PASS model. The results may be iterated by the caller in document order (i.e. the order that is returned by ES), but I can't manipulate the order or reason about the temporal range represented in the results.

Ideally dates are not required to be set by the caller when creating or updating objects (i.e. they're auto-magic, populated on the server side), but if they are set by the caller, they're honored (not overwritten server-side). Obviously this falls on one end of the continuum of options, but if we're going to do it, let's do it.

markpatton commented 3 years ago

Review of options with recommendation: https://docs.google.com/document/d/1lliGYBk4Luz2syNZHH1ScESCZHKtEmSjOcteayppiaQ/edit?usp=sharing

birkland commented 3 years ago

One of the complications here is that PASS doesn't really have a well-defined API. When the PASS data model proves insufficient, we can peer into the implementation details and look at Fedora's raw RDF. We can use Fedora's "inbound relationships" feature to effectively perform some kinds of queries that are difficult or impossible in elasticsearch, etc.

With that context, the proposed solution of indexing Fedora's date fields effectively states that there probably ought to be read-only "system" date attributes in PASS's API that adopt the same semantics as Fedora's created and modified dates, and are entirely managed by the system. By not formally updating the model, indexing these fields ends up being a de facto partial implementation of read-only "system" dates as far as PASS's "API" is concerned.

If we think of the door being open to completing this work some day as part of forming a well-defined PASS API, then we can imaging possibly deciding to implement option (1) some day, and add these date properties to the PASS model for real. In that light, we probably ought to view (2) as having the potential to lead to (1) down the road, and design the field/representation/doc accordingly. Since the proposed date fields are read-only "system" fields, I think we'd need some way to set them apart.

@markpatton Do you agree with this perspective, and the need to somehow distinguish these "system" fields from the others? If so, can you propose (maybe update the doc) a specific way of doing that? Off the top of my head, I can think of

The challenge: While indexed docs aren't formally JSON-LD, the PASS model representation is. So whatever we do would need to make sense as JSON-LD.

markpatton commented 3 years ago

@birkland I'm not sure they really need to be distinguished in the JSON-LD representation. Documentation and appropriate errors from APIs may be enough. Having a separate system object is a significant increase in complexity for clients, but certainly could be done. If we want to distinguish system properties in the JSON-LD, another option is to have a different namespace for them. Then depending on how we define the context we could have our canonical representation be system:createDate or createDate etc.