Swirrl / ook

Structural search engine
https://search-prototype.gss-data.org.uk/
Eclipse Public License 1.0
6 stars 0 forks source link

Paged ETL #13

Closed Robsteranium closed 3 years ago

Robsteranium commented 3 years ago

Introduces batching (pages of resources) and scoping (to a set of dataset-uris) to the ETL pipeline

Robsteranium commented 3 years ago

This doesn't work for the complete ETL of all observations yet, but ought to fulfil #5. List of target datasets tbc, I just picked the first 3 trade ones for now (although they aren't quite ready for this as the :dim qb:codeList :cl statements are missing (see ons slack for more).

Robsteranium commented 3 years ago

I've batched the process so we can now load everything an OOME.

It will be slow for the whole of the staging database (I estimate about 8 hours). I'd like to explore speeding this up a bit (#16).

This PR includes a configuration for loading only the published trade datasets:

  (require 'ook.concerns.integrant)

  (def result
    (ook.concerns.integrant/exec-config
     {:profiles ["drafter-client.edn"
                 "cogs-staging.edn"
                 "elasticsearch-development.edn"
                 "trade-data.edn"]}))

This currently gives you this:

$ curl -X GET "localhost:9200/_cat/indices?v=true"                                                                                                                                        

health status index       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   component   ac1J7krFSSyk1fC2OIocTw   1   1         53            0       23kb           23kb
yellow open   code        YYok7EsJTZuhD5ssdVxajw   1   1        105            0       54kb           54kb
yellow open   observation eTcUw-GLQW-2oELVfkC-aQ   1   1     445529            0     67.3mb         67.3mb
yellow open   dataset     msEOa0MLTNGqcjqCB0osSQ   1   1         10            0     40.6kb         40.6kb

Which should hopefully be enough for the time being.