NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.23k stars 164 forks source link

Ability to pass a JSON object instead of SortedFile #113

Closed apatel-fn closed 1 year ago

apatel-fn commented 1 year ago

The capacity to start a dsort shuffle/sort with a custom orderfile becomes difficult when working within a VPC that doesn't support HTTPS. Currently, custom training jobs cannot easily request an ordering without having to manage the uploading process of a OrderFile. Ideally, we wouldn't want to push to object storage and use HTTPS requests to pass to the dsort api without creating different configurations for the aistore permissions. It may be useful to allow this to be requested from the API/SDK with a JSON object that encodes this data! This also greatly simplifies the ability to do custom batch balancing for resampling methods that are not easily configurable from content or file name sorting (or the outlined OrderFile process).

Right now, this behavior can be done with individual xactions, but there can be a considerable performance benefit to marshall the request directly in JSON, and allowing dsort to manage the memory allocations to process the action faster than a series of xactions (especially for a large dataset).

Is there an easy way do do the above within the current setup?

VirrageS commented 1 year ago

Hello @apatel-fn, thanks for the comment.

It may be useful to allow this to be requested from the API/SDK with a JSON object that encodes this data

How do you envision this? Should AIStore provide JSON structure that people would expose and with that we would be able to get the data?

apatel-fn commented 1 year ago

Hi @VirrageS ! Yes, exactly, similar to the existing notation that OrderFile/EKM is organized in, with a Tupe(Filename, new_shardname) Schema, where the new shard/archive name is keyed with the list of files that should belong to. Based on my naive look at the code, the RequestSpec is transforming the OrderFile into an EKM, and having a way in the API/SDK to pass that in via a JSON payload consumed by the POST would be what I envision. Does that make sense?

VirrageS commented 1 year ago

@apatel-fn sorry for late response, do you have any links to this OrderFile or EKM so I could look it up?

apatel-fn commented 1 year ago

Sure! Here is the tests for OrderFile and ekm generation , OrderFile in request_spec, we would ideally add another Enum that is the JSON formatted content of what is in the OrderFile. This is how the external KeyMap is generated with the OrderingFile (Go-Referencing this function will show how its used in aistore/dsort.

VirrageS commented 1 year ago

Oh, I misunderstood then. I think I get it now.

So could you provide the JSON structure that you envision for this? "JSON formatted content"

apatel-fn commented 1 year ago

This would be in the same content of the orderfile is parsed, as a list of dictionary lists. For example

{
shard-sep-1.tar:
    [file1.img, file2.img, file3.img]
shard-set-2.tar:
   [file3.img, file4.img, file5.img]
}

Where the keys to the top level dictionary is the name of the shard (the same way it's in the order file) and the value is a list of strings that are the names of the files (as they are in the orderfile except being indented and new line separated).

VirrageS commented 1 year ago

Thanks, I will try to work on that.

VirrageS commented 1 year ago

It took much longer than expected but the support for JSON order file has been added.