dataverbinders / nl-open-data

A Flexible Python ETL toolkit for datawarehousing framework based on Dask, Prefect and the pydata stack
https://dkapitan.github.io/nl-open-data
MIT License
0 stars 1 forks source link

Optimize performance VM: try Zonal SSD PD #84

Open dkapitan opened 3 years ago

dkapitan commented 3 years ago

Best performance to date on VM: 4.4k records per second ingest from fetching URLs. Best performance on laptop DK: factor 3 higher than that (at least)

Question: what's the bottle neck?

Hypothesis: IO bound writing to disk, laptop is SSD, see performance chart GCE here

Idea to test: upgrade 100 GB balanced persistent disk to SSD.

Compared to VM compute cost (182 euro per month) this is negligible. The question is whether this will actually be faster: write IO for Zonal Balanced PD is the same as Zonal SSD PD.

On the other hand, I think the KPI to look for is sustained throughput. This is also influenced by disk size

But I think it is worth a try. If jobs run faster, this will actually save money because compute is the most expensive part.

Proposal: try let's try Zonal SSD PD with 256 GB, which is 40 euros a month (compared to 15.88). This should have a throughput of 122 MB/s (compared to 15 MB/s for standard pd; I can't find the throughput for balanced pd).

galamit86 commented 3 years ago

I'm having trouble with quotas, this time for storage space. The max allowed here is 250GB per region. I currently have 2 disks, each 100GB (one for the preemtible and one for the regular machine).

I already had a call with the Google representative, who told me that to increase these quotas, the easiest way would be to go through a Google Partner the first time, and afterwards it should be easier. Can we use CTS for this?

dkapitan commented 3 years ago

Yes, CTS is a certified partner. Would be good if you could initiate this, that way we can get our DevOps in order, too.