apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
35.26k stars 13.78k forks source link

Reduce memory footprint of s3 key trigger #40473

Closed dstandish closed 4 days ago

dstandish commented 5 days ago

Also had to make the provide_bucket_name decorator support async generators, and in the process i realized we didn't need a separate "async" version of the decorator.

dstandish commented 5 days ago

@eladkal fyi this would change get_file_metadata_async to be an iterator instead of occumulate everything in a list.

this lets _check_key_async return early if it finds a single match. we have a customer who reported high memory usage with s3 trigger so i looked into it and found this optimization.

but it would be technically breaking so i guess require a major, which is why i mention to you.

also the removal of the _async form of the decorator would also be technically "breaking"

eladkal commented 5 days ago

but it would be technically breaking so i guess require a major,

Please add 9.0.0 to versions list https://github.com/apache/airflow/blob/26768d94085e8fb6ab188be01f860f18278293cc/airflow/providers/amazon/provider.yaml#L27-L28

also please short description at the top of the change log https://github.com/apache/airflow/blob/main/airflow/providers/amazon/CHANGELOG.rst explaining to users what has been changed and what it means for them (how to migrate from 8.x to 9.0), similar to what we do with newsfragment for Airflow core)

potiuk commented 4 days ago

@eladkal fyi this would change get_file_metadata_async to be an iterator instead of occumulate everything in a list.

this lets _check_key_async return early if it finds a single match. we have a customer who reported high memory usage with s3 trigger so i looked into it and found this optimization.

but it would be technically breaking so i guess require a major, which is why i mention to you.

also the removal of the _async form of the decorator would also be technically "breaking"

I am not sure whether it's breaking. It changes the behaviour, for sure (like any other change) but the likelihood it actually breaks someone's workflow is very low. If you are iterating over the results (which is the intended usage of that method), things will continue to work. So the INTENTION of the API has not changed - some of the internal details of it and characteristics (less memory used, lazy evaluation) changed. Also - the decorators are by definition not part of the API - they are part of the method implementation and again - impact the implementation detail, but - unless you inspect details of the decorator - from the user perspective, nothing changed.

IMHO this is literally just an optimization, and I would not qualify it as a breaking change.

This is a great example of something that I advocate for quite some time - not every "method signature" change - even if technically breaking, is breaking. What reallly counts is the intentions of the one who wrote the method on how it should be used, matched with assesment of probabllity the changed thing was used properly - according to the intention.

In this case, the intention has not changed, it should be used in the same way before and after. And IMHO probability that it was used not according to the intention and this change will break it, is very low.

potiuk commented 4 days ago

BTW. Yes it shoudl be mentioned in the changelog, but not as a breaking change IMHO.

eladkal commented 4 days ago

In that case all we need is just the info note I mentioned. I will classify this as bug fix but the note in the change log will make sure that users are aware of the impact

dstandish commented 4 days ago

ok @eladkal i went with the "it's a bugfix" approach and added note in changelog. PTAL

o-nikolas commented 4 days ago

Is there a way to tell the build steps that were done as part of PR that's already been merged? Doc builds are failing in main for the amazon provider, and I'm curious how they didn't fail in this PR.

dstandish commented 4 days ago

Is there a way to tell the build steps that were done as part of PR that's already been merged? Doc builds are failing in main for the amazon provider, and I'm curious how they didn't fail in this PR.

weird, maybe this https://github.com/apache/airflow/actions/runs/9716729092

o-nikolas commented 4 days ago

Is there a way to tell the build steps that were done as part of PR that's already been merged? Doc builds are failing in main for the amazon provider, and I'm curious how they didn't fail in this PR.

weird, maybe this https://github.com/apache/airflow/actions/runs/9716729092

Yeah, that's all I found as well, but that can't be the full set of tests since there is almost nothing there. Or maybe it was and that explains how it got through, I'm not sure.

potiuk commented 14 hours ago

Is there a way to tell the build steps that were done as part of PR that's already been merged? Doc builds are failing in main for the amazon provider, and I'm curious how they didn't fail in this PR.

Click "View details" button at the last push :

Screenshot 2024-07-02 at 22 30 40