Closed weberxie closed 3 years ago
Thanks for your quick response!
In other words, the training data store on the HDFS.
Which of the following do you mean precisely:
Thanks!
The 2nd.
The training data of our group stored on HDFS, we'd like to accelerate the reading process, and trying to find an easy way to let PyTorch read data on HDFS with the same API as the native PyTorch Dataset.
Thanks for your reply and your great project!
Understood, thank you for the suggestion! We will confer with the team whether this feature is feasible and if yes - when.
We'll get back to you once we have any progress on this one
Hey @weberxie, we were planing to implement the HDFS support. But we've stumbled on some design problem. We have a notion of a bucket which is similar to way it works in for example AWS S3. Unfortunetelly, the HDFS doesn't have such abstraction at all and we don't have much idea on how this could work. To be more specific, the questions we are facing:
We were thinking about two approaches:
(bucket, tmp/object/file.txt)
would translate to just lookup(/tmp/object/file.txt)
. Listing buckets would just list buckets which were created by any operation./
(root). The request (bucket, tmp/object/file.txt)
would translate to lookup(/bucket/tmp/object/file.txt)
. There is slight problem with this approach. Specifically that there is no a good way to access files under root directory for example /file.txt.
. But here the operations are a little bit more safe because the buckets are isolated (in the first approach it can be very easy to access a different bucket but still override someones file).Let us know if any of these approaches make sense. And of course we are open to any suggestions/design that you may have in mind!
Great news!
What is a bucket in terms of HDFS? (is it just abstraction on our side?)
HDFS does not have the concept of bucket.
What would it mean to list buckets in HDFS or list objects inside the bucket?
According to my knowledge, it means to list files under the folder.
I think Bucket and root paths are not the same thing in the strict sense, the Bucket is more like the parent path for storing training data.
HDFS does not have the concept of bucket.
According to my knowledge, it means to list files under the folder.
Yeah, I know but it's not the HDFS-specific question but rather for us and how we interpret this. Because in some sense we need to create some abstraction of a bucket otherwise the AIStore will not be able to handle the request.
I think Bucket and root paths are not the same thing in the strict sense, the Bucket is more like the parent path for storing training data.
I see. How then would you transform a request from a user (bucket_name, object_name)
to HDFS client request? More specifically, what file it should request?
How then would you transform a request from a user (bucket_name, object_name) to HDFS client request? More specifically, what file it should request?
if the object_name is the path, you can get files with HDFS API directly. I think you can assume the bucket_name is the parent path of the training file.
I think you can assume the bucket_name is the parent path of the training file.
But you cannot do that because bucket_name
cannot contain any slashes. Only alphanumeric characters.
The HDFS support as backend provider has been implemented. It's in beta stage (not yet fully tested but we are working on it!) so there is potential of encountering a bug. We encourage filling an issue if something doesn't work as expected or is not fully clear.
Documentation:
Hey!
Could you explain what do you mean by "support"?
Thanks! Janusz