Open aronchick opened 11 months ago
Here's a bit more context:
As a developer, data scientist and ml engineer with experience with kubernetes, it was always pain in the proverbial butt to manage both data and images. It's not enough to have kubernetes compute cluster, you also need some sort of blob storage (S3), docker registry and secret management to glue all 3 together.
Since bacalhau network has access to ipfs storage and, at the end, that's where data ends up after all, there is no reason why not to provide super easy and straightforward way to handle these hurdles. Here is use case I envision:
As a developer, I built docker image with my code in and have my dataset locally. Let say it's using docker image my-compute-image
and data is structured like so:
./dataset/
partition1/ ....files
partition2/ ...files
...
partition100/ ...files
I've tested my code locally and it runs for single partition docker run -v dataset/partition1:/data my-compute-image superb_computation_script.py --input /data
It works locally, perfect. Now let's run it on entire dataset.
bacalhau push image my-compute-image --name my-compute-image
bacalhau push data ./dataset --name my-dataset
bacalhau docker run \
--id-only \
--wait \
--input bacalhau://my-expanso-username/data/my-dataset:/data
bacalhau://my-expanso-username/images/my-compute-image
superb_computation_script.py --input /data
This would lower barrier of entry dramatically and allow people to handle their data and code securely without needs of poking holes in their firewall so bacalhau nodes somewhere can pull docker images from non-public registries.
Why not go even further i.e bacalhau push model
? Given the design of ipfs and ipld in particular ADL I guess we can think of providing versioning "git for data" options powered by "prolly trees" see:
This is what services like dolt and lakefs does. Similarly, bacalhau could provide "private object repositories".
Regarding images maybe this is useful https://github.com/ipdr/ipdr. I also kind of like this https://github.com/uber/kraken
Very good feature!
Private registry:
bacalhau push image
Private data:
bacalhau push data
- allow people to push data to a data storage location that we manage (and streamline into the job itself)