ETL offline question - Githubissues

kmatzen commented 2 years ago

I'm trying to get a better idea of the lifecycle of a bucket contents with respect to a offline ETL. For example, if an ETL has been executed on the entire bucket and has completed, then a new object is put into the bucket, will a subsequent execution of the ETL automatically skip over results that have already been computed or will it materialize an entirely new set of outputs? What about if an object is deleted from the source bucket? Will the corresponding output from the ETL be deleted? If it does reuse results, is there a way to invalidate them, if necessary?

alex-aizman commented 2 years ago

then a new object is put into the bucket

A new object is put into a new - destination - bucket. Offline ETL works like bucket-to-bucket copy, where the destination may or may not exist but it's a different bucket. Implementation-wise, and skipping over various involved abstractions ("mountpath", "jogger", "xaction") and details, there's a per data disk thread that runs and traverses its portion of the source bucket's content, and uses offline ETL reader to generate transformed objects. In a sense, regular bucket copy is a trivial special case of the above.

You can then do whatever with the resulting (transformed) dataset. It's just another bucket, no different from all others in terms of management policies (eviction, redundancy, etc.) that can be applied. You can train your model with it, one time or many times, and then remove it, partially or entirely, etc. etc.

aaronnw commented 1 year ago

will a subsequent execution of the ETL automatically skip over results that have already been computed or will it materialize an entirely new set of outputs

If the next ETL run is on the entire bucket, the latter will happen. If you want a smaller ETL run, offline single object or multi-obect ETL is supported as well.

What about if an object is deleted from the source bucket? Will the corresponding output from the ETL be deleted?

No, not automatically. To put it simply, as Alex said, it's really just a copy with an operation in between source and destination.

NVIDIA / aistore

ETL offline question #111