Kelpie shepherds long-running jobs through to completion on interruptible hardware, coordinating with the Kelpie API
POST /jobs
DELETE /jobs/:id
GET /jobs/:id
GET /jobs
Kelpie is for anyone who wants to run long running compute-intensive jobs on Salad, the world's largest distributed GPU cloud. Whether that's LoRA training, Monte Carlo simulations, Molecular Dynamics simulations, or anything else, Kelpie can help you run your jobs to completion, even if they take days or weeks. You bring your own docker container that contains your script and dependencies, add the Kelpie binary to it, and deploy.
If you'd like to join the Kelpie beta, and are an existing Salad customer, just reach out to your point of contact via email, discord, or slack. If you're interested in Kelpie and are new to Salad, sign up for a demo, and mention you're interested in using Kelpie.
Kelpie is a standalone binary that runs in your container image. It coordinates with the Kelpie API to download your input data, upload your output data, and sync progress checkpoints to your s3-compatible storage. You submit jobs to the Kelpie API, and those jobs get assigned to salad worker nodes that have the Kelpie binary installed.
If you define scaling rules, the Kelpie API will handle starting and stopping your container group, and scaling it up and down in response to job volume.
When a job is assigned to a worker, the worker downloads your input data, and your checkpoint, and runs your command with the provided arguments and environment variables. When files are added to a directory defined in your job definition, Kelpie uploads that file to the bucket and prefix you've defined. When your command exits successfully, the output directory you defined is uploaded to your storage, the job is marked as complete, and a webhook is sent to the url you've provided, if any.
You can find a working example here
# Start with a base image that has the dependencies you need,
# and can successfully run your script.
FROM yourimage:yourtag
# Add the kelpie binary to your container image
ADD https://github.com/SaladTechnologies/kelpie/releases/download/0.4.3/kelpie /kelpie
RUN chmod +x /kelpie
# Use kelpie as the "main" command. Kelpie will then execute your
# command with the provided arguments and environment variables
# from the job definition.
CMD ["/kelpie"]
When running the image, you will need additional configuration in the environment:
AWS_ACCESS_KEY_ID
, etc to enable the kelpie worker to upload and download from your bucket storage. We use the s3 compatability api, so any s3-compatible storage should work.KELPIE_API_URL
: the root URL for the coordination API, e.g. kelpie.saladexamples.comKELPIE_API_KEY
: Your api key for the coordination API, issued by Salad for use with kelpie. NOT your Salad API Key.Additionally, your script must support the following things:
CHECKPOINT_DIR
, so that the job can be resumed if it gets interrupted. Similarly, when your script starts, it should check CHECKPOINT_DIR
to see if there is anything to resume, and only start from the beginning if no checkpoint is present.sync
block in the job definition.
INPUT_DIR
: Where to look for whatever data is needed as input. This will be downloaded from your bucket storage by kelpie prior to running the script.CHECKPOINT_DIR
: This is where to save progress checkpoints locally. kelpie will handle syncing the contents to your bucket storage, and will make sure any existing checkpoint is downloaded prior to running the script.OUTPUT_DIR
: This is where to save any output artifacts. kelpie will upload your artifacts to your bucket storage.Upload your docker image to the container registry of your choice. Salad supports public and private registries, including Docker Hub, AWS ECR, and GitHub Container Registry, among others.
You can deploy your container group using the Salad API, or via the Salad Portal. You will need to add the kelpie salad user (currently shawn.rushefsky@salad.com) to your organization to enable the scaling features of kelpie. Kelpie uses the Salad API to start, stop, and scale your container group in response to job volume.
In your container group configuration, you will provide the docker image url, the hardware configuration needed by your job, and the environment variables detailed above. You do not need to enable Container Gateway, or Job Queues, and you do not need to configure probes. While Salad does offer built-in logging, it is still recommended to connect an external logging service for more advanced features.
Once your container group is deployed, and you've verified that the node starts and runs successfully, you'll want to retrieve the container group ID from the Salad API. You will use this ID when submitting jobs to the Kelpie API.
There are live swagger docs that should be considered more accurate and up to date than this readme: https://kelpie.saladexamples.com/docs
Your kelpie api key is used by you to submit work, and also by kelpie workers to pull and process work.
All requests to the Kelpie API must include the header:
X-Kelpie-Key: myapikey
All API requests should use a base url of https://kelpie.saladexamples.com
.
Queueing a job for processing is a post request to the Kelpie API. You must provide a command to run, and optionally arguments, environment variables, and sync instructions. A job must also be assigned to a specific container group, using the container group id. You can get your container group id from the Salad API.. You can optionally provide a webhook url to receive status updates about your job.
POST /jobs
Request Body
{
"command": "python",
"arguments": [
"/path/to/main.py",
"--arg",
"value"
],
"environment": { "SOME_VAR": "string"},
"webhook": "https://myapi.com/kelpie-webhooks",
"container_group_id": "97f504e8-6de6-4322-b5d5-1777a59a7ad3",
"sync": {
"before": [
{
"bucket": "string",
"prefix": "string",
"local_path": "string",
"direction": "download",
"pattern": "checkpoint-*"
}
],
"during": [
{
"bucket": "string",
"prefix": "string",
"local_path": "string",
"direction": "upload",
"pattern": "*.bin"
}
],
"after": [
{
"bucket": "string",
"prefix": "string",
"local_path": "string",
"direction": "upload",
"pattern": "*.bin"
}
]
}
}
Response Body
{
"id": "8b9c902c-7da6-4af3-be0b-59cd4487895a",
"user_id": "your-user-id",
"status": "pending",
"created": "2024-04-19T18:53:31.000Z",
"started": null,
"completed": null,
"canceled": null,
"failed": null,
"command": "python",
"arguments": [
"/path/to/main.py",
"--arg",
"value"
],
"environment": { "SOME_VAR": "string"},
"webhook": "https://myapi.com/kelpie-webhooks",
"heartbeat": null,
"num_failures": 0,
"container_group_id": "97f504e8-6de6-4322-b5d5-1777a59a7ad3",
"machine_id": null,
"sync": {
"before": [
{
"bucket": "string",
"prefix": "string",
"local_path": "string",
"direction": "download",
"pattern": "checkpoint-*"
}
],
"during": [
{
"bucket": "string",
"prefix": "string",
"local_path": "string",
"direction": "upload",
"pattern": "*.bin"
}
],
"after": [
{
"bucket": "string",
"prefix": "string",
"local_path": "string",
"direction": "upload",
"pattern": "*.bin"
}
]
}
}
You can cancel a job using the job id
DELETE /jobs/:id
Response Body
{
"message": "Job canceled"
}
As mentioned above, Kelpie does not monitor the progress of your job, but it does track the status (pending, running, canceled, completed, failed). You can get a job using the job id:
GET /jobs/:id
Response Body
{
"id": "8b9c902c-7da6-4af3-be0b-59cd4487895a",
"user_id": "your-user-id",
"status": "pending",
"created": "2024-04-19T18:53:31.000Z",
"started": null,
"completed": null,
"canceled": null,
"failed": null,
"command": "python",
"arguments": [
"/path/to/main.py",
"--arg",
"value"
],
"webhook": "https://myapi.com/kelpie-webhooks",
"heartbeat": null,
"num_failures": 0,
"container_group_id": "97f504e8-6de6-4322-b5d5-1777a59a7ad3",
"machine_id": null,
"sync": {
"before": [
{
"bucket": "string",
"prefix": "string",
"local_path": "string",
"direction": "download",
"pattern": "checkpoint-*"
}
],
"during": [
{
"bucket": "string",
"prefix": "string",
"local_path": "string",
"direction": "upload",
"pattern": "*.bin"
}
],
"after": [
{
"bucket": "string",
"prefix": "string",
"local_path": "string",
"direction": "upload",
"pattern": "*.bin"
}
]
}
}
Get your jobs in bulk.
GET /jobs
Query Parameters
All query parameters for this endpoint are optional.
name | description | default |
---|---|---|
status | pending, running, completed, canceled, failed | none |
container_group_id | query only jobs assigned to a specific container group | none |
page_size | How many jobs to return per page | 100 |
page | Which page of jobs to query | 1 |
asc | Boolean. Sort by created , ascending |
false |
Response Body
{
"_count": 1,
"jobs": [
{
"id": "8b9c902c-7da6-4af3-be0b-59cd4487895a",
"user_id": "your-user-id",
"status": "pending",
"created": "2024-04-19T18:53:31.000Z",
"started": null,
"completed": null,
"canceled": null,
"failed": null,
"command": "python",
"arguments": [
"/path/to/main.py",
"--arg",
"value"
],
"webhook": "https://myapi.com/kelpie-webhooks",
"heartbeat": null,
"num_failures": 0,
"container_group_id": "97f504e8-6de6-4322-b5d5-1777a59a7ad3",
"machine_id": null,
"sync": {
"before": [
{
"bucket": "string",
"prefix": "string",
"local_path": "string",
"direction": "download",
"pattern": "checkpoint-*"
}
],
"during": [
{
"bucket": "string",
"prefix": "string",
"local_path": "string",
"direction": "upload",
"pattern": "*.bin"
}
],
"after": [
{
"bucket": "string",
"prefix": "string",
"local_path": "string",
"direction": "upload",
"pattern": "*.bin"
}
]
}
}
]
}
/work
. In these requests, it includes some information about what salad node you're on, including the machine id and container group id. This ensures we only hand out work to the correct container group, and that we do not hand out to a machine where that job has previously failed.If you provide a url in the webhook field, the Kelpie API will send status webhooks. It makes a POST
request to the url provided, with a JSON request body:
{
"status": "running",
"job_id": "some-job-id",
"machine_id": "some-machine-id",
"container_group_id": "some-container-group-id"
}
Webhook status may be running
, failed
, or completed
Webhooks sent by the Kelpie API will be secured with your API token in the X-Kelpie-Key
header.