Closed chrisgorgo closed 6 years ago
This job is stuck because it's running into the permission issue that was fixed but is not included in the 0.0.3 tag.
I see - I just triggered a new release of the mindboggle BIDS App with the fix.
I would expect this to fail when running rather than being stuck in runnable. Is there a blacklist of apps/versions with known issues preventing them to run?
Yeah, this should hang in starting/running or fail and you are right, this is another issue with this job. I've reopened the Amazon support ticket (4167759301) for this Batch deadlock.
FYI - I deployed 0.0.4 and it still seems to be stuck in RUNNABLE.
On Mon, Oct 2, 2017 at 11:31 AM, Nell Hardcastle notifications@github.com wrote:
Yeah, this should hang in starting/running or fail and you are right, this is another issue with this job. I've reopened the Amazon support ticket (4167759301 <(416)%20775-9301>) for this Batch deadlock.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OpenNeuroOrg/openneuro/issues/120#issuecomment-333623626, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOkp1pOrLGiA0kW-vlOKUjxE0hsPGHVks5soSvzgaJpZM4PphbF .
As a workaround for this, you can test with +/- 500MB of memory from the 8000MB value used. This is caused by ECS starting workers that cannot run the job due to a lack of memory, which is a unit conversion bug in Batch or ECS. The job requests 8388608000 bytes and ECS starts an instance with 8000000000 bytes, which never accepts the job. Upping the memory requested will start a larger instance or lowering it will run on the existing instance.
bumped it to 8500MB - still only RUNNABLE
On Mon, Oct 2, 2017 at 1:01 PM, Nell Hardcastle notifications@github.com wrote:
As a workaround for this, you can test with +/- 500MB of memory from the 8000MB value used. This is caused by ECS starting workers that cannot run the job due to a lack of memory, which is a unit conversion bug in Batch or ECS. The job requests 8388608000 <(838)%20860-8000> bytes and ECS starts an instance with 8000000000 bytes, which never accepts the job. Upping the memory requested will start a larger instance or lowering it will run on the existing instance.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OpenNeuroOrg/openneuro/issues/120#issuecomment-333647890, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOkp5M8R244oy7717AGrTf-a-ulE0Jqks5soUETgaJpZM4PphbF .
Amazon has acknowledged this bug and escalated it to the Batch team.
It looks like your 8500MB job did run, it was waiting on ECS instances to get allocated and started.
Thanks - they run correctly. Closing.
https://openneuro.dev.sqm.io/datasets/ds000105/versions/00001?app=mindboggle&version=1
Job has been in "runnnable" state for about an hour now.