apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.49k stars 14.13k forks source link

Airflow SFTPToGCSOperator sftp file exist check #41472

Open kandharvishnu opened 1 month ago

kandharvishnu commented 1 month ago

Description

Currently, the operator fails if a file is not found on the SFTP server. To make this behavior more flexible, we could introduce a new parameter, fail_on_sftp_file_not_exist, allowing users to customize the operator's response in such scenarios. By setting this parameter, users can choose whether the operator should raise an error when a file is missing or simply log a message and continue processing.

Use case/motivation

This enhancement would give users greater control over the operator's behaviour, particularly in environments where missing files are expected or where continued processing is preferred.

Related issues

A similar issue has been opened for the SFTPToS3Operator.

Are you willing to submit a PR?

Code of Conduct

geraj1010 commented 1 month ago

Is there an issue with using an SFTPSensor for checking file existence, prior to running SFTPToGCSOperator?

kandharvishnu commented 1 month ago

Is there an issue with using an SFTPSensor for checking file existence, prior to running SFTPToGCSOperator?

The SFTP sensor will wait until the file is detected. This issue is to determine whether the task should fail if the file is not found.

geraj1010 commented 1 month ago

Is there an issue with using an SFTPSensor for checking file existence, prior to running SFTPToGCSOperator?

The SFTP sensor will wait until the file is detected. This issue is to determine whether the task should fail if the file is not found.

I see. In that case, I don't think it makes sense to have the task succeed if a file is not found. A 'success' would not have much meaning anymore besides that the task simply ran. The interpretation of 'success' would become ambiguous i.e. how can you tell if a file was indeed moved or not without having to further dig into the logs? In addition, it doesn't feel very 'airflow-like' since we would be overloading the operator with functionality that already exists through a sensor.

I'm not sure what your DAG looks like, but perhaps you can use a SFTPSensor and trigger rules for your downstream tasks? There is soft_fail for sensors too, if you want the sensor to 'skip' instead of 'fail'.

kandharvishnu commented 1 month ago

Is there an issue with using an SFTPSensor for checking file existence, prior to running SFTPToGCSOperator?

The SFTP sensor will wait until the file is detected. This issue is to determine whether the task should fail if the file is not found.

I see. In that case, I don't think it makes sense to have the task succeed if a file is not found. A 'success' would not have much meaning anymore besides that the task simply ran. The interpretation of 'success' would become ambiguous i.e. how can you tell if a file was indeed moved or not without having to further dig into the logs? In addition, it doesn't feel very 'airflow-like' since we would be overloading the operator with functionality that already exists through a sensor. There is soft_fail for sensors too, if you want the sensor to 'skip' instead of 'fail'.

I agree with your point, but this is an optional parameter. By default, it will fail the task if the file is not found. If the user sets the fail_on_sftp_file_not_exist to False, it would make some difference.

I'm not sure what your DAG looks like, but perhaps you can use a SFTPSensor and trigger rules for your downstream tasks?

The final task of my DAG is looking for a file, and the dag_run should not fail, though the file is not present. And I don't want to use the sensor to poke for a file.

geraj1010 commented 1 month ago

Is there an issue with using an SFTPSensor for checking file existence, prior to running SFTPToGCSOperator?

The SFTP sensor will wait until the file is detected. This issue is to determine whether the task should fail if the file is not found.

I see. In that case, I don't think it makes sense to have the task succeed if a file is not found. A 'success' would not have much meaning anymore besides that the task simply ran. The interpretation of 'success' would become ambiguous i.e. how can you tell if a file was indeed moved or not without having to further dig into the logs? In addition, it doesn't feel very 'airflow-like' since we would be overloading the operator with functionality that already exists through a sensor. There is soft_fail for sensors too, if you want the sensor to 'skip' instead of 'fail'.

I agree with your point, but this is an optional parameter. By default, it will fail the task if the file is not found. If the user sets the fail_on_sftp_file_not_exist to False, it would make some difference.

I'm not sure what your DAG looks like, but perhaps you can use a SFTPSensor and trigger rules for your downstream tasks?

The final task of my DAG is looking for a file, and the dag_run should not fail, though the file is not present. And I don't want to use the sensor to poke for a file.

That's unfortunate you don't want to use a sensor for checking for file existence. If the sensor failed (i.e. file didn't exist after some time), you could have it soft_fail and it would skip (instead of fail) and you could choose which downstream tasks get skipped or ran based on trigger rules.

At any rate, I think as long as you use the trigger rule all_done for one (or more) of the downstream tasks, the dag_run will still be marked as successful. So you could have your SFTPToGCSOperator task fail (because the file doesn't exist on SFTP), and have the trigger rule for all other downstream tasks set to all_done, and they will still run and the dag_run will still be considered successful.