TotallyGatsby / DroneYard

AWS Batch based automation for OpenDroneMap.
GNU General Public License v3.0
14 stars 3 forks source link

AWS Batch jobs stuck in runnable state, compute environment becomes invalid #9

Open egs40 opened 3 months ago

egs40 commented 3 months ago

Thank you for your work on this project. I've encountered an issue while deploying the code following the instructions:

  1. Updated package.json to use "sst": "^2.41.4"
  2. Deployed using r7gd.4xlarge instances in the eu-west-1 region
  3. Uploaded images as instructed and started the workflow

Problem:

The AWS Batch dashboard shows a valid and enabled compute environment Jobs enter a runnable state but don't progress Jobs remain in runnable status indefinitely The compute environment eventually becomes invalid

I received a notification stating that all EC2 instances in the Batch compute environment were scaled down due to a misconfiguration preventing them from joining the ECS Cluster. The notification suggests reviewing and updating/recreating the compute environment configuration, mentioning possible issues such as:

Any insights on what might be causing this issue would be appreciated.

TotallyGatsby commented 3 months ago

It's very probable that the resources I had set up for this project have gotten too old. I have to get new footage myself in the next month or so of my property, so I'll take a look when I do at what seems to be the problem.

egs40 commented 3 months ago

Thanks, I appreciate it.

AlexCarusoFan4 commented 2 months ago

Hi there,

I recently ran into the same issue with my DroneYard solution (https://github.com/AlexCarusoFan4/WinyamaDroneYard).

Looks like instances are launched, but never registered with the ECS cluster. I tried using the latest ECS optimised Amazon Linux AMI, and completely re-deploying my solution, but neither of these worked.

Today I refactored the solution to use the latest aws-cdk-lib for Batch, rather than relying on the alpha package, and am now having success with running imagery processing jobs again.

Would definitely recommend giving that a try - hopefully does the trick for you.

zobis2 commented 1 month ago

so im having this issue aswell , i had a running stack working good but since ive upgrade it - it stuck in runnable state and the cause is "failed to start Amazon Elastic Container Service IAM" in the ec2 instance which is generated after a spot request and all other parts of the flow works well , ive tried https://github.com/AlexCarusoFan4/WinyamaDroneYard and codm repo aswell and ended up the same path - something in aws batch config is set off for all of those right now

if someone finds a solution ill be glad to know - thansk alot.@

AlexCarusoFan4 commented 1 month ago

Hi there,

Have you tried using the non-spot configuration?

I would advise against using spot instances as although it's significantly cheaper, it can cause your processing jobs to be interrupted regardless of specifying a bidding price at the on demand rate.

In any case I'll be doing a run of some imagery next week and will see if I'm getting the same issue.

UPDATE: Just tried running a quick test. Confirming my deployed on-demand instance solution still runs OK and jobs don't get stuck.

Again would advise against spot instances as this specific ODM workload is not designed to be stateless.

zobis2 commented 1 month ago

@AlexCarusoFan4 first , thanks alot for the repsonse . what do you mean - ON_DEMEAND - to change bidding stratgey from SPOT to ON_DEMAND? Im actully getting the EC2 instacnes up and running but those instnaces fail to start the entry.sh script as userdata fails somehow so even if ill initiate ON_DEMAND instance that will happen - anyhow im trying to do so right now

AlexCarusoFan4 commented 1 month ago

No problem at all.

Yes that's correct, I would recommend changing the EC2 instance type in your config file to ON_DEMAND.

Do you have the exact error message you're getting in regards to the entry.sh file? If it's that the file can't be found - it's likely the line endings for your file in your locally cloned repository.

I initially had this issue when first working with the DroneYard stack. I work from a windows computer, and had to explicitly change the line endings for the .sh file for it to be readable once deployed to the Linux ODM container.