Closed callum-oakley closed 1 year ago
Thanks @callum-oakley this looks good.
Is it worth my trying to test this myself? It looks like you've already done this well with the new cassettes.
This isn't inconceivable but I think it's a reasonable limit for the time-being. We've done something similar in other places.
resources/etl/observation-construct.sparql
is this an abuse of pmdcat?
pmdcat:graph
connects a catalog entry with the contents graph(s) which isn't what we're doing here. We need to connect a resource with it's graph.
I know it's only self-contained to ook but I think it's confusing to use this here. It'll be problematic if we ever want to serve RDF from ook (e.g. for an observations API).
I don't think something exists for this. RDF Reification doesn't appear to provide something for this purpose. sd:graph is also inappropriate. Indeed I can't see anything obvious on linked open vocabularies.
I think we might want to coin ook:graph
. I've created an issue for this #116.
I can see the attraction of DRYing this up but I feel like this makes it harder to call functions at the REPL.
I can type the URL from memory but the connection needs a function call that depends on another namespace i.e. (esr/connect "http://localhost:9200" {:content-type :json})
.
This is probably just a question of personal preference! I'm happy to leave this as is but I'd be curious to learn how you worked with it...
I've created #118 for this.
We need to revert :drafter/endpoint-url
in resources/idp-beta.edn
.
Perhaps we should rename the :ook.index/data
ig-key now that it's defined in ook.etl
.
ook.etl.observation-pipeline
shouldn't hardcode gss-data.org.uk
, I'd pop this in the integrant config in e.g. ook/resources/idp-beta.edn
and ook/resources/cogs-staging.edn
(i.e. it's an attribute of the data source).
Is this robust to not having modified times? What happens if the target drafter doesn't support it? Does drafter have a data migration or similar to bootstrap modified graph times for graphs created before the feature was live?
What do we need to do to coordinate deployment with drafter?
Sorry it's taken so long to get back to this @Robsteranium
Is it worth my trying to test this myself?
I'd like to test this (and re-generate cassettes) once more on the latest data after idp-beta-drafter
is updated to a drafter with modified times, since the data will probably have drifted somewhat, but otherwise I'm fairly confident with the tests as they are, so your call. I just rebased on main and I think I've resolved everything else you mentioned, except:
This is probably just a question of personal preference! I'm happy to leave this as is but I'd be curious to learn how you worked with it...
At the repl I was using dev/with-system
, but I'd be happy to revert the connection change if you'd rather keep it as it is, I don't want to break your existing workflow! My thought process was less about repeating ourselves in the code and more about repeating work that we do on every request by moving it to happen once at init... but as it happens connect
doesn't actually do much (you'd be forgiven for thinking it connects to ES... but it doesn't), so I don't mind much either way.
ook.etl.observation-pipeline shouldn't hardcode gss-data.org.uk
This is hardcoded in all the JSON-LD frames, and in a couple of other places in the source. Wouldn't be a big job to pull it out in to config, but maybe one for a quick follow up PR?
Is this robust to not having modified times? What happens if the target drafter doesn't support it? Does drafter have a data migration or similar to bootstrap modified graph times for graphs created before the feature was live?
No. No observations will be loaded, because we only load observations for graphs with new or changed modified times. Yes.
What do we need to do to coordinate deployment with drafter?
No coordination really, except that drafter needs to be updated first. Once drafter is on a new enough version, and the migration run (as part of the deployment) to backfill modified times, then we can deploy new OOK and run the ETL. We don't need any kind of migration on ES since the ETL will create the new indices the first time it's run. I think Ric is hoping to have new drafter running on idp-beta-drafter pretty soon 🤞
That's great, thanks @callum-oakley.
Don't worry about the delay, it sounds like we need to get the drafter change rolled-out first anyway.
Let's just stick with dev/with-system
- it doesn't seem worth faffing about this!
You're right about the hard-coding in the jsonld and I'd not thought about that. Let's leave that to another PR. The good thing is that these references will be easy to find!
In terms of getting this deployed @RicSwirrl, I suggest:
If you could ping me when 1 is ready I'll take it from there.
We might want to resolve #120 as part of this PR.
I've confirmed that this works with additions but haven't be able to confirm with a deletion yet (pending examples from ONS). I suggest we also aim to resolve #120 before merging.
I can confirm that deletes work. I'm going to merge this now as it's been open for 6 months! We can resolve #120 separately.
Changes the ETL pipeline so that only observations for datasets that have changed are re-indexed. Achieved by:
Other things that had to change to make this happen:
:ook.concerns.elastic/conn
instead of:ook.concerns.elastic/endpoint
because all we ever did with the endpoint was wrap it in a connectionidp-beta.edn
is pointing athttps://drafter-bump-drafter.publishmydata.com
, which is running a version of drafter with the modified times graph. Some of the tests have changed because the data is slightly different, but I updated the tests againstmain
, so I'm confident they are correct (bugs inmain
notwithstanding)Open questions/gotchas:
graph->modified
reads the entire graph index, but assumes it contains at most 10000 documents (since that's the size limit on the query). I think this is a reasonable assumption for the time-being? But if not we'll have to do some paging or store thegraph->modified
map elsewhere (just writing it to disk as a blob of JSON or EDN would probably be sufficient)closes #17