cortexlabs / cortex

Production infrastructure for machine learning at scale
https://cortexlabs.com/
Apache License 2.0
8.02k stars 607 forks source link

Can't remove a job with one worker using BatchAPI and DELETE request #1473

Closed lapaniku closed 4 years ago

lapaniku commented 4 years ago

Version

cli version: 0.20.0

Description

I've started a job with 1 worker and I can successfully get its status: $ cortex get imagesearch-batch 69bfbc4539100651 --env aws

job id: 69bfbc4539100651
status: running

start time: 22 Oct 2020 08:38:57 UTC
end time:   -
duration:   52m47s

batch stats
total   succeeded   failed   avg time per batch
100     5           0        10m8.08s

worker stats
requested   running   succeeded
1           1         0

job endpoint: https://hv5geqzdo9.execute-api.us-east-1.amazonaws.com/imagesearch-batch/69bfbc4539100651

job configuration
{
  "job_id": "69bfbc4539100651",
  "api_name": "imagesearch-batch",
  "workers": 1,
  "config": {
    "dest_s3_dir": "s3://..."
  },
  "api_id": "69bfbc5ab59b0d80--b2c011c906d3de3174508d257cd93b84",
  "spec_id": "b2c011c906d3de3174508d257cd93b84",
  "predictor_id": "6bdcd409fa472756e2a38468565cbc57325c6a323ca935da22dbd541bae05b1",
  "sqs_url": "https://sqs.us-east-1.amazonaws.com/.../1ee11af4ed-imagesearch-batch-69bfbc4539100651.fifo",
  "total_batch_count": 100,
  "start_time": "2020-10-22T08:38:57.025302536Z"
} 

Though when I tried to request : curl -X DELETE https://hv5geqzdo9.execute-api.us-east-1.amazonaws.com/imagesearch-batch/69bfbc4539100651 The result is: {"message":"Not Found"}

Which is a bit strange as was able to delete jobs with several workers before.

Configuration

- name: imagesearch-batch
  kind: BatchAPI
  predictor:
    type: python
    path: predictor.py
  compute:
    cpu: 1

Steps to reproduce

  1. Create a cluster with t3.small instances and default parameters min:1-max:5
  2. Submit a job with several batches but 1 worker so it's going to last for a while
  3. Try to delete this job using curl -X DELETE and API URL + jobib

Expected behavior

The corresponding job should be successfully stopped.

Actual behavior

Job continues to run after the DELETE request.

Stack traces

No issues with the job running:

2020-10-22 08:39:23.978178:cortex:pid-140:INFO:loading the predictor from predictor.py 2020-10-22 08:39:27.785006:cortex:pid-140:INFO:polling for batches... 2020-10-22 08:39:27.864598:cortex:pid-140:INFO:processing batch 4b320cba-a335-40f6-ba50-e301e96f4859 2020-10-22 08:49:37.230117:cortex:pid-140:INFO:processing batch 0a3df022-6aa0-4d77-acdc-d2c11f7229c1 2020-10-22 08:59:27.638303:cortex:pid-140:INFO:processing batch c573396b-00d6-4055-a870-9c74e669e904 2020-10-22 09:09:57.368275:cortex:pid-140:INFO:processing batch ab729014-459f-4604-8c3a-582cfd5fe089 2020-10-22 09:20:05.450738:cortex:pid-140:INFO:processing batch ef24b433-661c-4507-86f0-449dc01ac701 2020-10-22 09:30:08.732410:cortex:pid-140:INFO:processing batch ea7dbff5-f3e9-4db6-93e0-1dd5e890e2d6 2020-10-22 09:41:09.451085:cortex:pid-140:INFO:processing batch f926de1a-500a-45f2-9253-7c886fcd448f

Suggested solution

Might be connected with using only 1 worker as I was able to successfully delete jobs with 5 workers before.

vishalbollu commented 4 years ago

Thanks for bringing this to our attention. After investigating this issue, we believe that this bug could be caused by some non-deterministic behaviour during the API Gateway route setup. We are continuing to investigate this issue. In the meantime, you could try to deploy your BatchAPI by explicitly disabling API Gateway:

- name: api:
  kind: BatchAPI
  networking:
    api_gateway: none # the default is public