aelzeiny / airflow-aws-executors

Airflow Executor for both AWS ECS & AWS Fargate
MIT License
53 stars 9 forks source link

Task status failing when in AWS batch are succesful #7

Closed MattFanto closed 3 years ago

MattFanto commented 3 years ago

First of all thanks for this incredible idea, I made a small POC using your library and for a basic DAG is working fine except for a really annoying issue.

Whenever I run a task the task this is successfully submitted to AWS batch but the outcome in airflow is always reported as failed, even if the batch job is successful, the same happens with Airflow example dags as well. I checked the log and it seems

scheduler_1  | [2021-02-10 22:00:07,784] {scheduler_job.py:1199} INFO - Executor reports execution of tutorial.print_date execution_date=2021-02-08 00:00:00+00:00 exited with status failed for try_number 1
scheduler_1  | [2021-02-10 22:00:07,784] {scheduler_job.py:1199} INFO - Executor reports execution of tutorial.print_date execution_date=2021-02-09 00:00:00+00:00 exited with status failed for try_number 1
scheduler_1  | [2021-02-10 22:00:07,791] {scheduler_job.py:1235} ERROR - Executor reports task instance <TaskInstance: tutorial.print_date 2021-02-08 00:00:00+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?

below my airflow.cfg for airflow2.0.1

[core]
executor = aws_executors_plugin.AwsBatchExecutor
load_examples = False

[logging]
remote_logging = True
remote_base_log_folder = s3://b2waste-data-logs/airflow-logs
remote_log_conn_id = s3Logs
# Use server-side encryption for logs stored in S3
encrypt_s3_logs = False

[webserver]
workers = 2

[batch]
region = eu-central-1
job_name = airflow-task
job_queue = airflow-job-queue
job_definition = airflow-job-defined

AWS credentials are set over ENV variables.

Jobs are submitted to AWS batch with Fargate as compute environment

Did you face this issue before?

lelzeiny commented 3 years ago

Well, that's not good. I'll try to replicate on my own AWS cluster for that version of Airflow & see what I get.

lelzeiny commented 3 years ago

I personally have never tried the Fargate Environment for the batch executor, but I've always wanted to. Now's as good a time as any.

aelzeiny commented 3 years ago

Sorry, I was logged into my little sister's account for some reason.

I was able to replicate the error by misassigning IAM roles. Basically Airflow is like "hey executor run this". Then the executor is like "sure". Then Airflow is like, "how's that task going?" The executor is like "It failed". Airflow scheduler is like "Oh my lawd, was the task killed externally somehow?"

Actually, it was. The murder mystery spoiler alert is that AWS DID IT!. To a get a failure like this, it means that you've successfully queued a job (batch submit-job api) in AWS Batch, and then the scheduler was successfully able to check up on the status of that job (batch describe-job api). Therefore the problem must be that the launched container must have failed in Batch.

So go to the AWS Batch Dashboard, and see if there are "FAILED" tasks in your job queue. Here's an example of a job that failed for me.

image

Notice under status reason is the reason for the failure. Turns out that I didn't set my execution IAM role in my job definition properly, so it couldn't pull from my Elastic Container Repository (ECR - like AWS DockerHub). Can you share yours?

MattFanto commented 3 years ago

Hi @aelzeiny, thanks for the quick reply I retried few times this morning with a debugger on the scheduler and I realized that the status returned by AWS was actually FAILED, probably I made some configuration mistake.

After a cleanup of the docker images and airflow DB, everything is working fine image

image

Probably it was related to a mismatch in the SQL_CONN variable.

Thanks a lot