Open ryxli opened 2 months ago
Hello @ryxli, Thank you for your interest to s3-connector-for-pytorch. Currently, with the existing mountpoint-s3-client library, that we are using for communication with S3, it is not possible to directly load data from S3 into GPU memory, skipping the CPU step. The torch.load function, as per the documentation, deserializes the data on the CPU before loading it into tensors.
However, you can use the map_location
parameter in torch.load
to load the deserialized data directly onto the GPU after the CPU step. For example:
MODEL_PATH = 's3://bucket_name/checkpoint.chk'
MAP_LOCATION = torch.device('cuda:device_id')
model.load_state_dict(torch.load(MODEL_PATH, map_location=MAP_LOCATION))
This approach still involves the CPU step but allows you to load the data onto the GPU immediately after deserialization.
Please let us know if this answers your question and if there is anything else we can do to help.
thanks, I know about the torch.load / torch.save api. In 99% of cases today, checkpointing is composed of D2H -> Serialization (cpu) -> dump to storage (filesystem or s3), and similar inverse for loading
From your answer, it's currently not supported, but was asking if your team / Mountpoint team has any items on the roadmap to look into https://docs.nvidia.com/gpudirect-storage/overview-guide/index.html for direct memory transfer from device to storage (or s3 in this case), hence the skipping of the cpu step.
At the moment, there are no immediate plans to investigate GPU Direct Storage integration for the s3-connector-for-pytorch project. However, I appreciate you raising this idea, as it could potentially benefit certain use cases.
I'm curious to understand the rationale behind your suggestion better. My understanding is that torch.load and torch.save require CPU processing for serialization/deserialization. If that's the case, would the intention be to turn off object compression to skip the CPU step? While this could potentially reduce CPU overhead, it may also result in larger object sizes and longer download times from S3.
Alternatively, if the goal is to bypass torch.load and torch.save altogether when transferring data directly to the GPU, could you please elaborate on the tools or approaches you have in mind? Understanding the specific use case and requirements would help evaluate the feasibility and potential impact of exploring this feature.
Regarding GPU Direct Storage, my understanding is that it enables direct data transfer between GPU memory and storage devices by leveraging specialized storage drivers that support GPU-accelerated file operations (cuFile* primitives) at the kernel level. Could you please confirm if this high-level understanding is correct?
Tell us more about this new feature.
Is there possibility with current CRT/boto3 for GPU Direct to S3, wondering if it is possible to skip the S3 -> CPU -> GPU with torch.load, or if there is already functionality that supports this.