What's the canonical way to ETL an incremental pipeline of this data?

wpfl-dbt commented 7 months ago

I work at DBT and have been improving an ETL pipeline for gov.uk content we have based on parameters the department needs. I'd like to configure it so it ingests and overwrites data that's changed rather than ingesting everything over and over again.

My plan is:

Use the search API and the updated_at field to return results changed in the last few days
Use the content API to fetch the content, recursing through related pages to pick up collection children etc, again filtering on updated_at for new content

From the other side of the API, is that a good plan?

Is updated_at reliably updated? Is is safe to base a pipeline on?
Do I actually need to recurse through the children once this in incremental? It's in there as we found filtering on our department in the search API missed lots of documents our department published in related pages
On testing I'll sometimes get JSONDecodeError for very new items, which makes me think I'm picking up drafts. Is there a field I'm missing to ignore these until they're ready?

bilbof commented 7 months ago

@wpfl-dbt I no longer work on gov.uk search but the public timestamp in the JSON and Atom feed responses are reliable for subscribing to changes to documents - docs are written to search api almost immediately after being published / updated.

If you share more about your goals for the pipeline this'd help to give advice.

the search API missed lots of documents our department published in related pages

Related pages - I guess you mean the related_links attribute on content items? These aren't generated by the search api. Since they're not complete or deterministic (more a human user navigational aid) I wouldn't use them for your purpose.

To find docs published by your department you can either use the topic taxonomy or use the search api's organisation filter to get a complete machine-readable list.

I'll sometimes get JSONDecodeError for very new items, which makes me think I'm picking up drafts

Drafts aren't published to the search-api so this is probably something else. I'd check the status code and raw body of the response the next time you see this error.

wpfl-dbt commented 6 months ago

Thanks so much for coming back on this @bilbof , hugely appreciated! The assurance that we're okay to subscribe to those published/updated fields is perfect, thank you.

The aim of my pipeline is to have the a table of published text and some metadata of this content for our analysts to use for ML models.

On the missing documents, I've put together an example.

The net zero strategy is an example of content I'd want my pipeline to ingest. This API request, which is essentially what I'm using plus a query string, returns the publication, but it doesn't return all the links I'd need to send to the content API to get the published text -- if I didn't recurse on what comes back from this content API request I'd miss the text of individual documents like 1. Why Net Zero, 2. The journey to Net Zero, etc.

Am I missing something here? Is there a better way to do this?

wpfl-dbt commented 6 months ago

Also, on the missing documents, an example of the problem (at the time of writing) is the Specialist Investigations Alcohol Guidance manual from HMRC. The page lists and links to SIAG1000 Introduction: contents, a link that currently 404s for me in the frontend.

Similarly in the content API, the manual page returns SIAG1000 in its child sections, and the API call to the page also 404s. Looking through the API I can't see a way to avoid this kind of error?

alphagov / search-api

What's the canonical way to ETL an incremental pipeline of this data? #2839