[dev] job stuck in "runnable" state while queu empty

OpenNeuroOrg / openneuro

A free and open platform for analyzing and sharing neuroimaging data

https://openneuro.org/

MIT License

111 stars 39 forks source link

[dev] job stuck in "runnable" state while queu empty #120

Closed chrisgorgo closed 6 years ago

chrisgorgo commented 6 years ago

https://openneuro.dev.sqm.io/datasets/ds000105/versions/00001?app=mindboggle&version=1

Job has been in "runnnable" state for about an hour now.

nellh commented 6 years ago

This job is stuck because it's running into the permission issue that was fixed but is not included in the 0.0.3 tag.

chrisgorgo commented 6 years ago

I see - I just triggered a new release of the mindboggle BIDS App with the fix.

I would expect this to fail when running rather than being stuck in runnable. Is there a blacklist of apps/versions with known issues preventing them to run?

nellh commented 6 years ago

Yeah, this should hang in starting/running or fail and you are right, this is another issue with this job. I've reopened the Amazon support ticket (4167759301) for this Batch deadlock.

chrisgorgo commented 6 years ago

FYI - I deployed 0.0.4 and it still seems to be stuck in RUNNABLE.

On Mon, Oct 2, 2017 at 11:31 AM, Nell Hardcastle notifications@github.com wrote:

Yeah, this should hang in starting/running or fail and you are right, this is another issue with this job. I've reopened the Amazon support ticket (4167759301 <(416)%20775-9301>) for this Batch deadlock.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OpenNeuroOrg/openneuro/issues/120#issuecomment-333623626, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOkp1pOrLGiA0kW-vlOKUjxE0hsPGHVks5soSvzgaJpZM4PphbF .

nellh commented 6 years ago

As a workaround for this, you can test with +/- 500MB of memory from the 8000MB value used. This is caused by ECS starting workers that cannot run the job due to a lack of memory, which is a unit conversion bug in Batch or ECS. The job requests 8388608000 bytes and ECS starts an instance with 8000000000 bytes, which never accepts the job. Upping the memory requested will start a larger instance or lowering it will run on the existing instance.

chrisgorgo commented 6 years ago

bumped it to 8500MB - still only RUNNABLE

On Mon, Oct 2, 2017 at 1:01 PM, Nell Hardcastle notifications@github.com wrote:

As a workaround for this, you can test with +/- 500MB of memory from the 8000MB value used. This is caused by ECS starting workers that cannot run the job due to a lack of memory, which is a unit conversion bug in Batch or ECS. The job requests 8388608000 <(838)%20860-8000> bytes and ECS starts an instance with 8000000000 bytes, which never accepts the job. Upping the memory requested will start a larger instance or lowering it will run on the existing instance.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OpenNeuroOrg/openneuro/issues/120#issuecomment-333647890, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOkp5M8R244oy7717AGrTf-a-ulE0Jqks5soUETgaJpZM4PphbF .

nellh commented 6 years ago

Amazon has acknowledged this bug and escalated it to the Batch team.

It looks like your 8500MB job did run, it was waiting on ECS instances to get allocated and started.

chrisgorgo commented 6 years ago

Thanks - they run correctly. Closing.