Closed apatel-fn closed 1 year ago
Hello @apatel-fn, thanks for the comment.
It may be useful to allow this to be requested from the API/SDK with a JSON object that encodes this data
How do you envision this? Should AIStore provide JSON structure that people would expose and with that we would be able to get the data?
Hi @VirrageS ! Yes, exactly, similar to the existing notation that OrderFile
/EKM is organized in, with a Tupe(Filename, new_shardname)
Schema, where the new shard/archive name is keyed with the list of files that should belong to. Based on my naive look at the code, the RequestSpec is transforming the OrderFile into an EKM, and having a way in the API/SDK to pass that in via a JSON payload consumed by the POST would be what I envision. Does that make sense?
@apatel-fn sorry for late response, do you have any links to this OrderFile
or EKM so I could look it up?
Sure! Here is the tests for OrderFile and ekm generation , OrderFile in request_spec, we would ideally add another Enum that is the JSON formatted content of what is in the OrderFile. This is how the external KeyMap is generated with the OrderingFile (Go-Referencing this function will show how its used in aistore/dsort.
Oh, I misunderstood then. I think I get it now.
So could you provide the JSON structure that you envision for this? "JSON formatted content"
This would be in the same content of the orderfile is parsed, as a list of dictionary lists. For example
{
shard-sep-1.tar:
[file1.img, file2.img, file3.img]
shard-set-2.tar:
[file3.img, file4.img, file5.img]
}
Where the keys to the top level dictionary is the name of the shard (the same way it's in the order file) and the value is a list of strings that are the names of the files (as they are in the orderfile except being indented and new line separated).
Thanks, I will try to work on that.
It took much longer than expected but the support for JSON order file has been added.
The capacity to start a
dsort
shuffle/sort with a custom orderfile becomes difficult when working within a VPC that doesn't support HTTPS. Currently, custom training jobs cannot easily request an ordering without having to manage the uploading process of a OrderFile. Ideally, we wouldn't want to push to object storage and use HTTPS requests to pass to thedsort
api without creating different configurations for the aistore permissions. It may be useful to allow this to be requested from the API/SDK with a JSON object that encodes this data! This also greatly simplifies the ability to do custom batch balancing for resampling methods that are not easily configurable from content or file name sorting (or the outlined OrderFile process).Right now, this behavior can be done with individual xactions, but there can be a considerable performance benefit to marshall the request directly in JSON, and allowing dsort to manage the memory allocations to process the action faster than a series of xactions (especially for a large dataset).
Is there an easy way do do the above within the current setup?