Swirrl / ook

Structural search engine
https://search-prototype.gss-data.org.uk/
Eclipse Public License 1.0
6 stars 0 forks source link

ETL use graph modified times to only re-index datasets that have changed #115

Closed callum-oakley closed 1 year ago

callum-oakley commented 2 years ago

Changes the ETL pipeline so that only observations for datasets that have changed are re-indexed. Achieved by:

Other things that had to change to make this happen:

Open questions/gotchas:

closes #17

Robsteranium commented 2 years ago

Thanks @callum-oakley this looks good.

Is it worth my trying to test this myself? It looks like you've already done this well with the new cassettes.

Having more than 10k graphs.

This isn't inconceivable but I think it's a reasonable limit for the time-being. We've done something similar in other places.

pmdcat:graph

resources/etl/observation-construct.sparql

is this an abuse of pmdcat?

pmdcat:graph connects a catalog entry with the contents graph(s) which isn't what we're doing here. We need to connect a resource with it's graph.

I know it's only self-contained to ook but I think it's confusing to use this here. It'll be problematic if we ever want to serve RDF from ook (e.g. for an observations API).

I don't think something exists for this. RDF Reification doesn't appear to provide something for this purpose. sd:graph is also inappropriate. Indeed I can't see anything obvious on linked open vocabularies.

I think we might want to coin ook:graph. I've created an issue for this #116.

Replacing URL with connection

I can see the attraction of DRYing this up but I feel like this makes it harder to call functions at the REPL.

I can type the URL from memory but the connection needs a function call that depends on another namespace i.e. (esr/connect "http://localhost:9200" {:content-type :json}).

This is probably just a question of personal preference! I'm happy to leave this as is but I'd be curious to learn how you worked with it...

Drafter versions

I've created #118 for this.

Clean-up

We need to revert :drafter/endpoint-url in resources/idp-beta.edn.

Perhaps we should rename the :ook.index/data ig-key now that it's defined in ook.etl.

ook.etl.observation-pipeline shouldn't hardcode gss-data.org.uk, I'd pop this in the integrant config in e.g. ook/resources/idp-beta.edn and ook/resources/cogs-staging.edn (i.e. it's an attribute of the data source).

Deployment

Is this robust to not having modified times? What happens if the target drafter doesn't support it? Does drafter have a data migration or similar to bootstrap modified graph times for graphs created before the feature was live?

What do we need to do to coordinate deployment with drafter?

callum-oakley commented 2 years ago

Sorry it's taken so long to get back to this @Robsteranium

Is it worth my trying to test this myself?

I'd like to test this (and re-generate cassettes) once more on the latest data after idp-beta-drafter is updated to a drafter with modified times, since the data will probably have drifted somewhat, but otherwise I'm fairly confident with the tests as they are, so your call. I just rebased on main and I think I've resolved everything else you mentioned, except:

This is probably just a question of personal preference! I'm happy to leave this as is but I'd be curious to learn how you worked with it...

At the repl I was using dev/with-system, but I'd be happy to revert the connection change if you'd rather keep it as it is, I don't want to break your existing workflow! My thought process was less about repeating ourselves in the code and more about repeating work that we do on every request by moving it to happen once at init... but as it happens connect doesn't actually do much (you'd be forgiven for thinking it connects to ES... but it doesn't), so I don't mind much either way.

ook.etl.observation-pipeline shouldn't hardcode gss-data.org.uk

This is hardcoded in all the JSON-LD frames, and in a couple of other places in the source. Wouldn't be a big job to pull it out in to config, but maybe one for a quick follow up PR?

Is this robust to not having modified times? What happens if the target drafter doesn't support it? Does drafter have a data migration or similar to bootstrap modified graph times for graphs created before the feature was live?

No. No observations will be loaded, because we only load observations for graphs with new or changed modified times. Yes.

What do we need to do to coordinate deployment with drafter?

No coordination really, except that drafter needs to be updated first. Once drafter is on a new enough version, and the migration run (as part of the deployment) to backfill modified times, then we can deploy new OOK and run the ETL. We don't need any kind of migration on ES since the ETL will create the new indices the first time it's run. I think Ric is hoping to have new drafter running on idp-beta-drafter pretty soon 🤞

Robsteranium commented 2 years ago

That's great, thanks @callum-oakley.

Don't worry about the delay, it sounds like we need to get the drafter change rolled-out first anyway.

Let's just stick with dev/with-system - it doesn't seem worth faffing about this!

You're right about the hard-coding in the jsonld and I'd not thought about that. Let's leave that to another PR. The good thing is that these references will be easy to find!


In terms of getting this deployed @RicSwirrl, I suggest:

  1. we wait until we've got the new drafter running somewhere
  2. we provision a new ook from this PR and trigger the ETL
  3. we do some drafting/ encourage ONS to do so
  4. we re-run the ETL and confirm it's a) up-to-date and b) doesn't take all day
  5. we merge the PR and put the new version of ook on search-prototype.gss-data.org.uk

If you could ping me when 1 is ready I'll take it from there.

Robsteranium commented 2 years ago

We might want to resolve #120 as part of this PR.

Robsteranium commented 1 year ago

I've confirmed that this works with additions but haven't be able to confirm with a deletion yet (pending examples from ONS). I suggest we also aim to resolve #120 before merging.

Robsteranium commented 1 year ago

I can confirm that deletes work. I'm going to merge this now as it's been open for 6 months! We can resolve #120 separately.