Questions on the ETL tutorial example

Hastyrush commented 11 months ago

Hi,

I was following the 3-part tutorial posted at https://aiatscale.org/blog/2023/05/05/aisio-transforms-with-webdataset-pt-1

My question is, was the example designed to work with Kubernetes only? As I tried running a local single node cluster using Minikube as documented on the deployment documentation (https://github.com/NVIDIA/aistore/blob/master/deploy/dev/k8s/README.md). This means that the ETL that was supposed to be performed at the storage cluster's compute is now happening on the local machine used for data fetching as well.

The result is that on calling batch = next(iter(dataloader)) as written in https://aiatscale.org/blog/2023/06/09/aisio-transforms-with-webdataset-pt-3, the pipeline runs extremely slow when fetching the batches, possibly due to CPU resource contention from the ETL processing and the data fetching pipeline?

This does not happen when the ETL is removed when creating the webdataset.

Thanks!

aaronnw commented 11 months ago

Hi!

Yes, the ETL functionality requires kubernetes deployments. This is because the transformer process is deployed as a separate pod within the k8s cluster.

There is really only a performance benefit to ETL when the cluster is remote. The performance improvement there is that the compute is local to the data and isn't using up any local resources, instead using the typically idle storage cluster compute. So it will work with a local minikube, but it does not really optimize anything.

When fetching the batch, because this ETL transforms an entire shard, you might see some slowness. This will depend on a lot of things though:

Your k8s resource allocation. I believe when I ran the experiment I used 8 cpus and 16 GB of memory on my minikube deployment and didn't see any slowness.
The size of the shards, batches and shuffle size. This can impact the memory required to transform each shard and how many shards need transformed for a given batch.
The complexity of the transform
The number of parallel dataloader workers you set

Hastyrush commented 11 months ago

Hi Aaron,

Thanks for the clarification! It helped a lot in understanding the requirement of Kubernetes in a seperate storage compute cluster. Will be closing this issue!

NVIDIA / aistore

Questions on the ETL tutorial example #144