Azure / Azurite

A lightweight server clone of Azure Storage that simulates most of the commands supported by it with minimal dependencies
MIT License
1.8k stars 320 forks source link

Is there a plan to support AdlsGen2 (Datalake Storage) on top of blobstore emulator ? #553

Open Arnaud-Nauwynck opened 4 years ago

Arnaud-Nauwynck commented 4 years ago

Which service(blob, file, queue, table) does this issue concern?

not existing currently in V3 : AdlsGen2 (Datalake)

This is a question: Is there a plan to support AdlsGen2 (Datalake Storage) on top of blobstore emulator ? If yes, when would it be available?

Which version of the Azurite was used?

V3, unusable yet, laking ADlsGen2 support

Where do you get Azurite? (npm, DockerHub, NuGet, Visual Studio Code Extension)

npm

What's the Node.js version?

What problem was encountered?

Steps to reproduce the issue?

Have you found a mitigation/solution?

no not able to develop on local with emulator, forced to connect to azure

XiaoningLiu commented 4 years ago

Hi @Arnaud-Nauwynck, can you further description your usage scenarios? For example, are you using a datalake gen2 account with or without hierarchical namespace enabled? Do you use any datalake storage sdks? which datalake gen2 features you are most interested? Interop between blob or datalake gen2?

rh99 commented 3 years ago

As for me, I'd like to Spark and hadoop-azure, as described here: https://stackoverflow.com/questions/65050695/how-can-i-read-write-data-from-azurite-using-spark

yuranos commented 3 years ago

Hi @XiaoningLiu We are using Gen 2 with Hierarchical namespace enabled. We write our apps with Akka Streams and need to write to and read from to Datalake Gen 2. Akka Streams has HDFS connector and since Datalake Gen 2 is HDFS-compatible, so far so good.

However, we need to add test coverage to our code. And not just basic unit tests, but something that can verify that integration is Ok end to end, but without being tightly coupled to network for real calls to Azure.

We implemented what we needed with testcontainers, but hadoop inside docker is a bit tricky to use and requires quite unconventional setup(with host resolutions, some env vars, permissions, etc.) while all we need is the filesystem for tests.

KlaudiuszBryjaRelativity commented 3 years ago

When do you release ADLS feature? BTW do you have plan to add support for file share?

karimdabbagh commented 3 years ago

Any updates on this?

XiaoningLiu commented 3 years ago

Hi guys, we get your feedbacks and the asks to support datalake gen2 on Azurite. We do think it's a valid ask and keep it open for collecting requirements and feedbacks. Unluckily, it's not our current priorities yet.

In the same time, if possible, please try to leverage your contacts with Azure, and reach to Azure AdlsGen2 team for your asks directly. We have once talked this ask with AdlsGen2 team, it's better they can get more direct feedbacks to better understand the scenario and importance.

blueww commented 3 years ago

Another request for datalake gen2 https://github.com/Azure/Azurite/issues/909

jonycodes commented 2 years ago

+1 This should be really helpful for testing features using ADLS Gen2.

Amro77 commented 2 years ago

Any updates on this

felixnext commented 2 years ago

@XiaoningLiu Is there any update on when this feature will arive?

ac710k commented 2 years ago

Any update on this?

arony commented 2 years ago

@blueww @XiaoningLiu any progress? That would be good to have this feature during integration tests

barry-jones commented 1 year ago

Yes please. Especially since it just straight up fails when testing .net AzureDataLakeFileClient connections and writes. Without a particularly useful error message.

kacperniwczykr1 commented 1 year ago

Is there any chance that we will receive any information on this? It for some scenarios lack of ADLS Gen2 support is a killer.

darena-patrick commented 1 year ago

Attempting to build a dockerized, fully self-contained environment for doing some end to end testing using Cypress. Would love if ADLS Gen2 support were available for this. Has any further discussion about this taken place @XiaoningLiu?

XiaoningLiu commented 1 year ago

Attempting to build a dockerized, fully self-contained environment for doing some end to end testing using Cypress. Would love if ADLS Gen2 support were available for this. Has any further discussion about this taken place @XiaoningLiu?

Yes, it's on our radar and being regularly reviewed. Azurite pending features are scheduled per customer asks, importance, workload and team resource. Currently we are working on prioritized work items like Blob Batch, User Delegation SAS etc.

dain commented 1 year ago

Support for ADLS Gen2 in this project is pretty critical for any team building support for ABFS. Without this simulator, testing integrations in projects like Trino and Iceberg will require coordination of volunteers that are trusted enough to have real Azure credentials, which slows development. I have used this project for building and testing integrations with blob APIs, and it makes this work enjoyable (I can assure you integrating with most cloud systems is just painful).

liabozarth commented 1 year ago

+1

N-o-Z commented 1 year ago

Support for ADLS Gen2 in this project is pretty critical for any team building support for ABFS. Without this simulator, testing integrations in projects like Trino and Iceberg will require coordination of volunteers that are trusted enough to have real Azure credentials, which slows development. I have used this project for building and testing integrations with blob APIs, and it makes this work enjoyable (I can assure you integrating with most cloud systems is just painful).

Would like to join the request - especially due to the fact that more and more users transition to ADLS Gen2

blueww commented 1 year ago

This is on our radar and being regularly reviewed. However, this is not in our recent priority list.

We will Azure AdlsGen2 team support to implement this feature in Azuite. If possible, please try to leverage your contacts with Azure, and reach to Azure AdlsGen2 team for your asks directly. It's better they can get more direct feedbacks to better understand the scenario and importance.

blueww commented 1 year ago

@MahmoudGSaleh , @N-o-Z, @Arnaud-Nauwynck, @liabozarth, @dain , @arony , @kacperniwczykr1, @felixnext , @barry-jones , @ac710k, @Amro77, @jonycodes

Would you please share how you would like to use AdlsGen2 with Azurite?

ADLS Gen2, though it is exposed as a REST API, it was designed to be used by drivers (ABFS mostly). Could you please share what features in ADLS Gen2 DFS endpoint are you interested in using via REST that is not exposed via Blob.

This information will help us to better priority the feature for Azurite.

Arithmomaniac commented 1 year ago

Directory creation and manipulation, for one. (The client may end up using the ABFS, but any C# server code that sets up a filesystem for e.g. integration testing will want to use the SDK.) And once you need to use ADLS REST for anything, there's a decent chance your application won't use the Blob API at all, even for things that it could.

N-o-Z commented 1 year ago

@blueww We need Azurite to simulate ADLS Gen2 behavior. Specifically in the way it deals with directories and objects listing (HNS). We provide our clients with services over their Azure storage accounts and we want to be able to test that our logic works both for blob storage and ADLS Gen2. We are using the Azurite simulator for our unit tests and as of now can only verify correctness against the blob storage behavior

kacperniwczykr1 commented 1 year ago

@blueww We have very similar scenario to this described by @N-o-Z. We want to make sure that data that we are working on is properly structured within our tests. HNS is key feature that we are missing.

karimdabbagh commented 1 year ago

@blueww Our service (c# code) primarily works with directories (and files). So, in order to have our unit/integration tests use the same client in our code we would need to either emulate HNS ourselves or use Azurite with HNS enabled.

dain commented 1 year ago

Would you please share how you would like to use AdlsGen2 with Azurite?

I work on Trino and we in the process of replacing Hadoop dependencies with custom code, because the Hadoop code is leaky, rarely updated, and kind not well maintained. As part of this we are building new file system interfaces that use the cloud storage APIs directly (instead of through HDFS). To write this code we need to be able to test, and Azureite is a great way for the, volunteer, open source developers to test changes without needing access to an Azure account. The key to making this work is the blob and dfs apis need to perform the exact same as Azure, which I have found to not be the case even for the blob apis (e.g., paths are not normalized in Azurie like they are in Azure). In general, without something like Azurite maintaining Azure integration will be harder (and generally that means less maintenance).

mlongtin0 commented 1 year ago

We make a heavy use of ACL. Azure doesn't provide very useful tools to manage ACL, so we made our own commands. However, we'd rather test against a local test server than a real storage account.

blueww commented 1 year ago

We have added a wiki for our requirements and general expectations of PRs that add new ADLS Gen2 to Azurite. https://github.com/Azure/Azurite/wiki/ADLS-Gen2-Implementation-Guidance

Azurite welcome contribution! If you would like help to implement ADLS Gen2 in Azurite, please read the wiki and follow it to design/implement ADLS Gen2 in Azurite (better review the detail design with us first), to get a smooth PR review / merge.

mlongtin0 commented 1 year ago

The DFS endpoint is not available in Azure if hierarchical namespace is not enabled. The blob endpoint works on HNS accounts with some limitation (indexed tags don't work with HNS, for example).

blueww commented 1 year ago

@mlongtin0 Per our test, currently DFS endpoint is available on storage account which not enabled hierarchical namespace, although there are still some DFS API/parameters not supported on this kind account. And from DFS rest API doc, you can see many parameters "only valid if Hierarchical Namespace is enabled for the account", which means the API is available in none HNS storage account, but these parameters not available on none HNS storage account.

mlongtin0 commented 1 year ago

My bad, I could swear I tried it and it failed. Seems to work fine.

jasonmohyla commented 1 year ago

+1 for HNS support for build pipeline unit tests

dekiesel commented 4 months ago

Bummer, just wasted 4h to get my tests to use azurite and now I see that adlsgen2 isn't supported (my fault!). Any guidance on how to test adlsgen2 calls locally otherwise (using python)?

stenneepro commented 1 month ago

We are using Azure Data Lake Storage Gen2 just to control access in folder level. For example, there are three roles - admin, supervisor, manager. Admin can access all folders, files in the container, supervisor can access only supervisor folder, manager can access only manager folder. We are generating SAS token when user login and they use the SAS token to access files.

It Azurite supports ADLS that would be great for local development. Or is there any other solution to implement folder level access with Azure Blob Storage?

cool-mist commented 1 month ago

@XiaoningLiu writes

Yes, it's on our radar and being regularly reviewed. Azurite pending features are scheduled per customer asks, importance, workload and team resource. Currently we are working on prioritized work items like Blob Batch, User Delegation SAS etc.

I will apologize in advance for being abrasive here, but this just feels like Microsoft is trolling the developers at this point. The only alternative that exists (test this against an actual storage account) incurs significant costs. Looking at the code base, this is should just be a week of work for the azure storage team to implement an in-memory version of a file system adhering to the ADLS Gen2 features :(. It has already been close to 4 years since the issue has been opened and the azure teams responsible for this effort have still not "prioritized" this work that improves the dev experience writing code against one of the core features of Azure storage!!

cool-mist commented 1 month ago

From the implementation design discussion here, I see the following

  1. Implement HNS metadata Store in Azurite i. Any schema change or new table design should be reviewed and signed off. ii. We need to maintain hierarchical relationships between parent-child dir/file. For example, we can add a table to match each item (blob/dir) with its parent, and integrate existing blob tables and the new table added above (Detail design need discussion). iii. Blob/file binary payload persistency based on local files shouldn’t be changed.

Would it be better to use some form of Trie as the underlying datastructure where each node corresponds to a path segment?

  1. Create path/directory = add a node.
  2. Delete path/directory = delete a node. If recursive flag is set, drop the sub-tree, else error
  3. Rename path/directory = detach the sub-trie and parent it under the new path.
  4. Update paths

Each node will contain the following metadata

  1. x-ms-properties dictionary
  2. acl rules for the path
  3. isDir flag
  4. file_data pointer to a utf-8 byte array

File Nodes can be restricted to not contain children. File nodes will additionally have a file_data pointer that points to a byte arrays. To start with, you could restrict the byte array length (say 4MB, and support max 4MB files). Allocate the byte arrays as an array of arrays and reuse the byte array themselves for memory reasons.

blueww commented 1 month ago

From the implementation design discussion here, I see the following

  1. Implement HNS metadata Store in Azurite i. Any schema change or new table design should be reviewed and signed off. ii. We need to maintain hierarchical relationships between parent-child dir/file. For example, we can add a table to match each item (blob/dir) with its parent, and integrate existing blob tables and the new table added above (Detail design need discussion). iii. Blob/file binary payload persistency based on local files shouldn’t be changed.

Would it be better to use some form of Trie as the underlying datastructure where each node corresponds to a path segment?

  1. Create path/directory = add a node.
  2. Delete path/directory = delete a node. If recursive flag is set, drop the sub-tree, else error
  3. Rename path/directory = detach the sub-trie and parent it under the new path.
  4. Update paths

Each node will contain the following metadata

  1. x-ms-properties dictionary
  2. acl rules for the path
  3. isDir flag
  4. file_data pointer to a utf-8 byte array

File Nodes can be restricted to not contain children. File nodes will additionally have a file_data pointer that points to a byte arrays. To start with, you could restrict the byte array length (say 4MB, and support max 4MB files). Allocate the byte arrays as an array of arrays and reuse the byte array themselves for memory reasons.

@cool-mist Thanks for the suggestion!

For your suggestion to use Trie. It might be applicable, however, the current comes from :

  1. we should use similar structure as the Azure server implementation to get similar behavior/performance as Azure server.
  2. Unitize the current Azurite implementation, change as less as possible to lower cost and lower regression risk.

We can re-visit it and discuss more when finish the phase I and start the Phase II implemetation.

Azurite welcome contribution! If you are interested in implementation Datalake Gen2 in Azurite, would you please raise your detail design, and when we get agreement on that, you could raise implementation PRs (should be split into several small PRs to help review). We could start from Phase I (DFS endpoint on FNS account), then Phase (DFS/BLob endpoint on HNS account).