2020-04-07-azure-blob-block-avro-storage.md: implementation questions

Hi, I just read your article and I found it super interesting. Thank you for sharing your work! At my job, we have pretty much the same requirements: large amount of sensor data, upserts, cost effectiveness. I am tempted to try your approach, but I have two questions.

Why did you decide to split the .avro files by fixed time intervals (2020-01-01T13:30:00--2020-01-01T13:40:00), rather than by sensor id and a (much longer) fixed time interval (e.g. sensor-1_2020-01-01T00:00:00--2020-01-31T00:00:00)? Wouldn't this make reading much easier, assuming that you mostly want to only read a single sensor time series? Or would this not work so well because the updates per file are rather small, which would make the upserts inefficient?

And another one: how did you implement the upserts efficiently? Were you, by any chance, able to implement this in Apache Spark? This is what we're looking at, we would need to feed the query store from a data lake, and then Spark would be our preferred way to go.

Thank you!

You're welcome! I'm glad you like it - it's always nice to have people read what you've written!

File format

So the reason that we've split the files into that format is primarily based on how we need to transport the data before it ends up in Azure. To give you some context I work at SCADA MINDS which does consultancy for the wind industry (so it seems like we're both in the business of renewables :)).

We get a bunch of sensor readings from turbines, which we then chunk into ten minute files and transfer into the cloud. We have to chunk them because we need the flexibility that smaller files give us (I can elaborate if you need)

However our data might:

Come out of order (so for 10:20 and then later on 10:10)
Only for some sensors in the ten minute interval, and then later on for other sensors in the same interval

The format is based on that, so the requirements were:

We need to efficiently be able to upsert data if it comes from the same interval but for a different subset of sensors.
We need to efficiently be able to insert data even if it comes out of order

Your suggestion of an alternative file format (sensor-1_2020-01-01T00:00:00--2020-01-31T00:00:00) could probably also solve both of these problems. Problem 1 is trivially solved (you just append to a different file), but Problem 2 is not as easily solved.

If your data comes in at 10:20 and then at 10:10 you'll have to either:

Move all the existing data in the file and update your indexes (so insert in between 10:00 and 10:10)
Live with the .avro file being out of order - you can obviously either order it yourself when doing data analysis or use an API that guarantees that the order is right.

Our solution always ensures that the files are in order with efficient upserts, but it does pay the penalty of having multiple files you need to read from if you need a lot of data. I don't think one solution is better than the others.

Efficient upserts

I don't know about Apache Spark, but yes using the Azure block blobs we're able to insert efficiently (one of the key criteria was that we'd like to be able to insert stuff without having to touch the existing data)

Let me know if you have any questions. I think if your organisation is up for it and it makes sense SCADA MINDS might also be able to help you with some code examples or developer support.

GeeWee / geewee.github.io

2020-04-07-azure-blob-block-avro-storage.md: implementation questions #12

File format

Efficient upserts