bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
https://docs.bacalhau.org
Apache License 2.0
649 stars 86 forks source link

Private Registry & Data #2698

Open aronchick opened 11 months ago

aronchick commented 11 months ago

Private registry:

Private data:

inc0 commented 11 months ago

Here's a bit more context:

As a developer, data scientist and ml engineer with experience with kubernetes, it was always pain in the proverbial butt to manage both data and images. It's not enough to have kubernetes compute cluster, you also need some sort of blob storage (S3), docker registry and secret management to glue all 3 together.

Since bacalhau network has access to ipfs storage and, at the end, that's where data ends up after all, there is no reason why not to provide super easy and straightforward way to handle these hurdles. Here is use case I envision:

As a developer, I built docker image with my code in and have my dataset locally. Let say it's using docker image my-compute-image and data is structured like so:

./dataset/
  partition1/ ....files
  partition2/ ...files
  ...
  partition100/ ...files

I've tested my code locally and it runs for single partition docker run -v dataset/partition1:/data my-compute-image superb_computation_script.py --input /data

It works locally, perfect. Now let's run it on entire dataset.

bacalhau push image my-compute-image --name my-compute-image
bacalhau push data ./dataset --name my-dataset
bacalhau docker run \
    --id-only \
    --wait \
    --input bacalhau://my-expanso-username/data/my-dataset:/data
    bacalhau://my-expanso-username/images/my-compute-image
    superb_computation_script.py --input /data

This would lower barrier of entry dramatically and allow people to handle their data and code securely without needs of poking holes in their firewall so bacalhau nodes somewhere can pull docker images from non-public registries.

LeonardAukea commented 11 months ago

Why not go even further i.e bacalhau push model? Given the design of ipfs and ipld in particular ADL I guess we can think of providing versioning "git for data" options powered by "prolly trees" see:

This is what services like dolt and lakefs does. Similarly, bacalhau could provide "private object repositories".

Regarding images maybe this is useful https://github.com/ipdr/ipdr. I also kind of like this https://github.com/uber/kraken

aronchick commented 6 months ago

Very good feature!