NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.23k stars 165 forks source link

Will AIStore support HDFS? #71

Closed weberxie closed 3 years ago

VirrageS commented 3 years ago

Hey!

Could you explain what do you mean by "support"?

Thanks! Janusz

weberxie commented 3 years ago

Thanks for your quick response!

In other words, the training data store on the HDFS.

knopt commented 3 years ago

Which of the following do you mean precisely:

  1. AIStore should be HDFS compatible service
  2. AIStore should support HDFS as a cloud provider. In other words, AIStore should support operations (GET, PUT, caching, ETL, distributed shuffle, and more) on data stored on the HDFS

Thanks!

weberxie commented 3 years ago

The 2nd.

The training data of our group stored on HDFS, we'd like to accelerate the reading process, and trying to find an easy way to let PyTorch read data on HDFS with the same API as the native PyTorch Dataset.

Thanks for your reply and your great project!

knopt commented 3 years ago

Understood, thank you for the suggestion! We will confer with the team whether this feature is feasible and if yes - when.

We'll get back to you once we have any progress on this one

VirrageS commented 3 years ago

Hey @weberxie, we were planing to implement the HDFS support. But we've stumbled on some design problem. We have a notion of a bucket which is similar to way it works in for example AWS S3. Unfortunetelly, the HDFS doesn't have such abstraction at all and we don't have much idea on how this could work. To be more specific, the questions we are facing:

We were thinking about two approaches:

  1. Buckets are created on the fly and we always say they exist when doing HEAD request. The bucket would have access to full filesystem meaning that the request (bucket, tmp/object/file.txt) would translate to just lookup(/tmp/object/file.txt). Listing buckets would just list buckets which were created by any operation.
  2. Buckets correspond to first level of directories. Listing buckets would just be the equivalent to listing all directories under / (root). The request (bucket, tmp/object/file.txt) would translate to lookup(/bucket/tmp/object/file.txt). There is slight problem with this approach. Specifically that there is no a good way to access files under root directory for example /file.txt.. But here the operations are a little bit more safe because the buckets are isolated (in the first approach it can be very easy to access a different bucket but still override someones file).

Let us know if any of these approaches make sense. And of course we are open to any suggestions/design that you may have in mind!

weberxie commented 3 years ago

Great news!

What is a bucket in terms of HDFS? (is it just abstraction on our side?)

HDFS does not have the concept of bucket.

What would it mean to list buckets in HDFS or list objects inside the bucket?

According to my knowledge, it means to list files under the folder.

I think Bucket and root paths are not the same thing in the strict sense, the Bucket is more like the parent path for storing training data.

VirrageS commented 3 years ago

HDFS does not have the concept of bucket.

According to my knowledge, it means to list files under the folder.

Yeah, I know but it's not the HDFS-specific question but rather for us and how we interpret this. Because in some sense we need to create some abstraction of a bucket otherwise the AIStore will not be able to handle the request.

I think Bucket and root paths are not the same thing in the strict sense, the Bucket is more like the parent path for storing training data.

I see. How then would you transform a request from a user (bucket_name, object_name) to HDFS client request? More specifically, what file it should request?

weberxie commented 3 years ago

How then would you transform a request from a user (bucket_name, object_name) to HDFS client request? More specifically, what file it should request?

if the object_name is the path, you can get files with HDFS API directly. I think you can assume the bucket_name is the parent path of the training file.

VirrageS commented 3 years ago

I think you can assume the bucket_name is the parent path of the training file.

But you cannot do that because bucket_name cannot contain any slashes. Only alphanumeric characters.

VirrageS commented 3 years ago

The HDFS support as backend provider has been implemented. It's in beta stage (not yet fully tested but we are working on it!) so there is potential of encountering a bug. We encourage filling an issue if something doesn't work as expected or is not fully clear.

Documentation: