Closed MattFanto closed 3 years ago
Well, that's not good. I'll try to replicate on my own AWS cluster for that version of Airflow & see what I get.
I personally have never tried the Fargate Environment for the batch executor, but I've always wanted to. Now's as good a time as any.
Sorry, I was logged into my little sister's account for some reason.
I was able to replicate the error by misassigning IAM roles. Basically Airflow is like "hey executor run this". Then the executor is like "sure". Then Airflow is like, "how's that task going?" The executor is like "It failed". Airflow scheduler is like "Oh my lawd, was the task killed externally somehow?"
Actually, it was. The murder mystery spoiler alert is that AWS DID IT!. To a get a failure like this, it means that you've successfully queued a job (batch submit-job api) in AWS Batch, and then the scheduler was successfully able to check up on the status of that job (batch describe-job api). Therefore the problem must be that the launched container must have failed in Batch.
So go to the AWS Batch Dashboard, and see if there are "FAILED" tasks in your job queue. Here's an example of a job that failed for me.
Notice under status reason is the reason for the failure. Turns out that I didn't set my execution IAM role in my job definition properly, so it couldn't pull from my Elastic Container Repository (ECR - like AWS DockerHub). Can you share yours?
Hi @aelzeiny, thanks for the quick reply I retried few times this morning with a debugger on the scheduler and I realized that the status returned by AWS was actually FAILED, probably I made some configuration mistake.
After a cleanup of the docker images and airflow DB, everything is working fine
Probably it was related to a mismatch in the SQL_CONN variable.
Thanks a lot
First of all thanks for this incredible idea, I made a small POC using your library and for a basic DAG is working fine except for a really annoying issue.
Whenever I run a task the task this is successfully submitted to AWS batch but the outcome in airflow is always reported as failed, even if the batch job is successful, the same happens with Airflow example dags as well. I checked the log and it seems
below my
airflow.cfg
for airflow2.0.1AWS credentials are set over ENV variables.
Jobs are submitted to AWS batch with Fargate as compute environment
Did you face this issue before?