Azure / Azurite

A lightweight server clone of Azure Storage that simulates most of the commands supported by it with minimal dependencies
MIT License
1.8k stars 320 forks source link

Performance (wishlist) #1117

Open jmelkins opened 2 years ago

jmelkins commented 2 years ago

Since the readme says "Please reach to us if you have requirements or suggestions for a distributed Azurite implementation or higher performance." here goes ...

My use case is testing a data processing/aggregation/analysis pipeline and I can't do this without using a lot of data locally. I have been able to run locally using the former Azure Storage Emulator which has adequate performance for my task. I have tried switching to Azurite (now it has table support) but I find it grinds to an almost complete halt once there are tens of thousands of messages in queues or rows in tables. It's fine because I can continue using the old emulator for now (I just downloaded the installer again in case it disappears) but perhaps it will one day not support some key aspect of Azure Storage.

So my wish list: Simple Windows installer, no dependency on installing node.js or docker. Uses a database (SQL server for me) for all data and metadata, or has a disk-based storage mechanism with similar performance for multiple thread insertion and deletion of queue messages / table rows / block blobs in large number. Option to store block blobs in a folder rather than the database to minimise database size, files to be stored individually on disk, replicating folder structure in Azure Storage. Metadata still in database presumably. Data survives upgrades/reinstallation of Azurite.

Piedone commented 2 years ago

Related: https://github.com/Azure/Azurite/issues/705

XiaoningLiu commented 2 years ago

Thanks for the suggestions and whishlist.

Different with previous Azure Storage emulator, Azurite is based on Node.js, cross-platform and support multi runtime (OS, VSC, docker). Azurite provides higher compatibility than classic emulator. It unblocks customer on Mac or Linux on different arch.

It's a tradeoff that to achieve the high compatibility, Azurite needs to store the metadata on a cross-platform store. Currently, the default store is based on LokiJs. It's a bottleneck that when metadata reaches several GB, performance of LokiJs is highly impacted.

It's true that there are other high performance metadata store solutions. But there is none can perfectly match all the compatibility requirements. For example, traditional SQLlite is a good lite solution with higher performance and data capacity than LokiJS, but the native implementation cannot well fit the Visual Studio Code extension for different platforms.

For Azurite Blob storage, it's an experimental feature to allow connect to an external SQL metadata store. It allows us to start multi Azurite processes to unblock the CPU utility limitation as mentioned by @Piedone. The experimental feature is there but we don't hear many feedbacks yet.

For windows installer without Node.js or docker. Visual Studio 2022 integrates azurite.exe. The standalone executable is currently signed for Visual Studio only. You can refer to the exe build script to build a standalone azurite.exe for your personal purpose.

edwin-huber commented 2 years ago

@jmelkins : Hi, could I ask if you need the persistence feature while you are running your tests?

I'm currently testing different persistence options and their effects on performance for different scenarios. We could then expose these more conveniently for different usage patterns.

Understanding requirements for higher load scenarios will help me tailor my tests.

Thanks!

jmelkins commented 2 years ago

Hi, thanks for your reply.

Yes, I need the persistence feature - the first part of my workflow involves downloading data to blobs followed by data extraction into SQL database and Azure tables and I need to avoid re-downloading and extracting the data while I test later stages of the process.

I suspect that the bottleneck in my workflow is table insertions and updates. Initially the process goes quickly, until the tables start filling up. I have two tables and at the point where the process has completely slowed down they have about 120,000 and 100,000 rows in them. The node.js process seems to be maxed out on 1 core usage although disk access seems to be low (just according to task manager).

edwin-huber commented 2 years ago

The default database backing the emulator is LokiJs, which has some options to increase performance.

If you need the data persisted between sessions (that means beyond a test run), then there is a setting which allows us to partition the database, based on collections, which can significantly improve the response times for larger data sizes. The frequency with which data is written to the backing store can also be modified, as can the indexing options.

Knowing which operations you use most, in which frequency, and how your data is modeled would help improve my test relevancy, as well as the solutions we can present to optimize performance.

Currently I am focusing my efforts on the Table store, but you also mentioned that you noticed reduced performance when the queues were large:

I find it grinds to an almost complete halt once there are tens of thousands of messages in queues or rows in tables.

Here too, it would be good to know the requirements, so that I can create more relevant test profiles.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.