Closed Robsteranium closed 3 years ago
This doesn't work for the complete ETL of all observations yet, but ought to fulfil #5. List of target datasets tbc, I just picked the first 3 trade ones for now (although they aren't quite ready for this as the :dim qb:codeList :cl
statements are missing (see ons slack for more).
I've batched the process so we can now load everything an OOME.
It will be slow for the whole of the staging database (I estimate about 8 hours). I'd like to explore speeding this up a bit (#16).
This PR includes a configuration for loading only the published trade datasets:
(require 'ook.concerns.integrant)
(def result
(ook.concerns.integrant/exec-config
{:profiles ["drafter-client.edn"
"cogs-staging.edn"
"elasticsearch-development.edn"
"trade-data.edn"]}))
This currently gives you this:
$ curl -X GET "localhost:9200/_cat/indices?v=true"
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open component ac1J7krFSSyk1fC2OIocTw 1 1 53 0 23kb 23kb
yellow open code YYok7EsJTZuhD5ssdVxajw 1 1 105 0 54kb 54kb
yellow open observation eTcUw-GLQW-2oELVfkC-aQ 1 1 445529 0 67.3mb 67.3mb
yellow open dataset msEOa0MLTNGqcjqCB0osSQ 1 1 10 0 40.6kb 40.6kb
Which should hopefully be enough for the time being.
Introduces batching (pages of resources) and scoping (to a set of dataset-uris) to the ETL pipeline