Make ETL pipeline faster

Swirrl / ook

Structural search engine

https://search-prototype.gss-data.org.uk/

Eclipse Public License 1.0

6 stars 0 forks source link

Make ETL pipeline faster #16

Open Robsteranium opened 3 years ago

Robsteranium commented 3 years ago

I reckon it'll take about 8 hours to process the whole staging database.

The trade data only takes 10 minutes at the moment which is manageable. This data is being worked-on at the minute with lots more publications due so we should expect it to take much longer in the coming weeks.

If we can keep it to within an hour then it'll be palatable.

An alternative would be to process only those datasets that change (so only the initial import is slow) as per #16.

Robsteranium commented 3 years ago

I've looked at parallelising this. We can run the ETL steps in parallel relatively easily. Even with only one concurrent extraction step this led to problems with stardog as gc pressure built up too much - i.e. having a break between queries while the T/L steps ran gave Stardog chance to recover.

This might be different now that we're paging observations by graph instead of using limit/offset.