apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.95k stars 14.26k forks source link

Use inclusive words in Apache Airflow project #15994

Closed turbaszek closed 2 years ago

turbaszek commented 3 years ago

Description

Apache Software Foundation is discussing how we can improve inclusiveness of projects and raise awareness of conscious language. Related thread on diversity@a.o: https://lists.apache.org/thread.html/r2d8845d9c37ac581046997d980464e8a7b6bffa6400efb0e41013171%40%3Cdiversity.apache.org%3E

Use case / motivation

We already have pre-commit check that checks for some word. However, on CLC (Conscious Language Checker) Apache Airflow seems to have problems with the following words:

Are you willing to submit a PR?

Related Issues

12982 https://github.com/apache/airflow/pull/9175

turbaszek commented 3 years ago

CC @leahecole

leahecole commented 3 years ago

Love it. A couple of clarifying questions

turbaszek commented 3 years ago
  • If I'm understanding correctly, the pre commit check is preventing the introduction of new instances of these words, but these are the words we have that were already in? (legacy non-inclusive words, I suppose? 😁 )

That's one reason. The other one is that #12982 is still opened. Also I some words cannot be removed because are used by 3rd party tools

  • Is this issue to go through and remove said words, or to introduce more automation?

Both, when we remove the word we should add it to existing rule or create new one similar to this: https://github.com/apache/airflow/blob/476d0f6e3d2059f56532cda36cdc51aa86bafb37/.pre-commit-config.yaml#L333-L341

Patil2099 commented 3 years ago

I would love to take this up @turbaszek

turbaszek commented 3 years ago

@Patil2099 I assigned you to this ticket 👏

Patil2099 commented 3 years ago

I am not an expert with regular expression. Can you help me with the regular expression? @turbaszek

turbaszek commented 3 years ago

I am not an expert with regular expression. Can you help me with the regular expression? @turbaszek

Sure, what we would like to check? What words?

Patil2099 commented 3 years ago

Sorry for the late reply, I have to edit the regex in pre-commit to solve this issue right? @turbaszek. I have to exclude all the words right? mentioned in https://github.com/apache/airflow/issues/15994#issue-898308474. I need a little help in that.

ktmud commented 3 years ago

Thanks for taking on this! Is there any chance that DummyOperator can be renamed to something else, too? I've heard dummy may be a non-inclusive word, too.

turbaszek commented 3 years ago

Sorry for the late reply, I have to edit the regex in pre-commit to solve this issue right? @turbaszek. I have to exclude all the words right? mentioned in #15994 (comment). I need a little help in that.

@Patil2099 yes, we should add similar hook to pre-commit-config as this one: https://github.com/apache/airflow/blob/476d0f6e3d2059f56532cda36cdc51aa86bafb37/.pre-commit-config.yaml#L333-L341 The regexp should be like (?i)(\bmaster\b|\bhe\b|\bshe\b|\bhis\b|\bher\b) so we match exact word not substrings like HElp. Also we will need to adjust all occurrences of non-inclusive words.

Thanks for taking on this! Is there any chance that DummyOperator can be renamed to something else, too? I've heard dummy may be a non-inclusive word, too.

@ktmud that's interesting suggestion. This would be a breaking change but I think it's worth trying. Any suggestions for a new name? CC @leahecole @kaxil @potiuk

potiuk commented 3 years ago

Yeah. We should likely change DummyOperator but we should deprecate it not remove (and remove it in 3.0). The 'NoOpOperator' or similar could be a good name.

eladkal commented 3 years ago

How about EmptyOperator or TasklessOperator (not sure if this may cause confusion as with Tasks)?

turbaszek commented 3 years ago

How about EmptyOperator or TasklessOperator (not sure if this may cause confusion as with Tasks)?

I like EmptyOperator, I was wondering about NodeOperator because in most of production use cases this operator represent just a node in a DAG.

kaxil commented 3 years ago

Hmmm.. If more people think that dummy is not an inclusive word then yes. What do others think, example Twitter Engg https://twitter.com/TwitterEng/status/1278733305190342656/photo/1 does think that "dummy value" term is non-inclusive.

How about EmptyOperator or TasklessOperator (not sure if this may cause confusion as with Tasks)?

potiuk commented 3 years ago

Yeah. Dummy is not only non-inclusive but also well a bit dummy :). I like EmptyOperator

uranusjr commented 3 years ago

In my AIP-39 implementation I called a time table that never schedules (schedule_interval=None) a NullTimeTable so that's another possibility. It's probably a good idea to keep the naming consistent, so I'll change it if we decide to go with something else.

turbaszek commented 3 years ago

+1 for NullOperator

potiuk commented 3 years ago

+1 works for me

On Tue, Jun 8, 2021 at 10:36 AM Tomek Urbaszek @.***> wrote:

+1 for NullOperator

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache/airflow/issues/15994#issuecomment-856576000, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAERMIZ76PYYKTBX4XIU3P3TRXJCPANCNFSM45JVCF6A .

-- +48 660 796 129

leahecole commented 3 years ago

NullOperator works for me - I have also seen "Placeholder" as a substitution for dummy in our docs if you prefer that

eladkal commented 3 years ago

As for dummy other than DummyOpeartor we also have TriggerRule.DUMMY https://github.com/apache/airflow/blob/5cd0bf733b839951c075c54e808a595ac923c4e8/airflow/utils/trigger_rule.py#L33

pateash commented 3 years ago

+1 for NullOperator

ashb commented 3 years ago

NoOpOperator

What, not "Noperator"? 😉

+1 for NullOperator.

eladkal commented 3 years ago

Opened PR https://github.com/apache/airflow/pull/17144 to handle the Dummy trigger rule deprecation. A question was raised in the PR review if we want NULL trigger rule or something else.

potiuk commented 3 years ago

After some comments in #17144 - how about NoneOperator ? We are in Python world and Null Operator does not feel right :)

kaxil commented 3 years ago

I have created polls:

Looks like EmptyOperator is getting more votes

patchandpray commented 3 years ago

NullOperator imo fits the bill as it identifies that an operation did occur it just did not create any value but did create some state in the metadata database also it does not conflict with Python's None, if it was named NoneOperator I would assume it would return None

leahecole commented 3 years ago

I have created polls:

Looks like EmptyOperator is getting more votes

Lol I think I said NullOperator here and EmptyOperator on Twitter 🤦🏻‍♀️ . And @ashb the Noperator makes me laugh and think of the Nopetopus tenor

andriisoldatenko commented 3 years ago

i like NullOperator too.

potiuk commented 3 years ago

It's really hilarious to see how we have all different opinions and how eager we are to participate in this discussion :)

potiuk commented 3 years ago

BTW. @leahecole - I also casted different votes, just for the fun of it :D

turbaszek commented 3 years ago

It's really hilarious to see how we have all different opinions and how eager we are to participate in this discussion :)

It reminds me discussion about logo re-design or asserts in production. But I like the idea of polls 👍 Do I correctly understand that we will pick the winning option @kaxil ?

kaxil commented 3 years ago

Probably the same as other votes we have had, we can consider all community votes as non-binding but committers & PMC can have binding votes. The Community votes at least give us direction on what our users want :) so the pressure on PMC. But open to suggestions. Looks like EmptyOperator is leading on all the polls followed by NullOperator and NoneOperator.

All the three mediums have the same voting results:

image image image

eladkal commented 3 years ago

I think EmptyOperator is a good choice.

DeOldSax commented 3 years ago

How about SentinelOperator? Imho in the domain of algorithms and data structures it's exactly that.

eladkal commented 2 years ago

PR for renaming DummyOperator: https://github.com/apache/airflow/pull/22832

The only task left on this issue is the task that this issue was created for :) We need a pre commit to improve inclusiveness as explained in the issue description.

edithturn commented 2 years ago

@eladkal I would love to take this issue

edithturn commented 2 years ago

Hello, I am taking this issue. In this step in pre-commit, we have words with "black" and it won't pass "language matters" validation, also we have repositories links like this: - repo: https://github.com/psf/black and this: - id: trailing-whitespace in the same pre-commit file Screenshot from 2022-04-19 10-47-31

Should we exclude .pre-commit-config.yam in validations?

Let me know if I missed something

eladkal commented 2 years ago

I don't think the hook scans the file it's defined in because then you won't be able to place the words you want to scan for...?

We have language-matters hook that checks for white/black (though I don't know why we exclude the files listed in it):

https://github.com/apache/airflow/blob/10078287e2bc5df0cf4f1889fdd6714c6a766b9e/.pre-commit-config.yaml#L407-L423

edithturn commented 2 years ago

This regex (?i)(black|white)[_-]?list just will find black and white followed with "list", if I am not wrong, for that reason it doesn't work completely with all words.

The bracketed characters [_-] (underscore and dash) indicate that the alphanumeric pattern must be followed by an underscore or a dash.

The correct could be this: (?ix) entry: > (?ix) \bmaster\b| \bhe\b| \bshe\b| \bhis\b| \bher\b| \bslave\b| \bsanity\b| \bdummy\b| \bwhite\b| \bblack\b

i = Use case-insensitive matching. For more information, (edited) x = Exclude unescaped white space from the pattern, and enable comments after a number sign (#).

I will try!

edithturn commented 2 years ago

**@eladkal I am not sure why we are excluding these files. Some of them exist and some don't. Some are from "provider/docs", so I am assuming that they are being excluded because we are using third-party documentation that we cannot change after validation.

(Exist)  ^airflow/www/fab_security/manager\.py$|
(Exist)  ^airflow/providers/apache/cassandra/hooks/cassandra\.py$|
(Exist)  ^airflow/providers/apache/hive/operators/hive_stats\.py$|
(Exist)  ^.pre-commit-config\.yaml$|
(Doesn't Exist) ^airflow/providers/apache/hive/.*PROVIDER_CHANGES_*|
(Doesn't Exist)  ^airflow/providers/apache/hive/.*README\.md$|
(Exist)  ^tests/providers/apache/cassandra/hooks/test_cassandra\.py$|
(Exist)  ^docs/apache-airflow-providers-apache-cassandra/connections/cassandra\.rst$|
(Exist)  ^docs/apache-airflow-providers-apache-hive/commits\.rst$|
(Doesn't Exist) git**
edithturn commented 2 years ago

It is working, should I work to change these words in CHANGELOG: master, dummy..etc?

No relative imports......................................................................Passed
Check for language that we do not accept as community....................................Failed
- hook id: language-matters
- exit code: 1

CHANGELOG.txt:358:- Deprecate dummy trigger rule in favor of always (#17144)
CHANGELOG.txt:1091:- Fix webserver exiting when gunicorn master crashes (#13518)(#13780)
CHANGELOG.txt:1176:- Fix link to Airflow master branch documentation (#13179)
CHANGELOG.txt:1232:- Fix webserver exiting when gunicorn master crashes (#13470)
CHANGELOG.txt:1603:- Move k8s executor out of contrib to closer match master (#8904)
CHANGELOG.txt:1654:- Update README to remove Python 3.8 limitation for Master (#9451)
CHANGELOG.txt:1752:- Dont schedule dummy tasks (#7880)
CHANGELOG.txt:[244](https://github.com/apache/airflow/runs/6082120417?check_suite_focus=true#step:11:244)9:- [AIRFLOW-5301] Some not-yet-available files from breeze are committed to master (#5901)
CHANGELOG.txt:2912:- [AIRFLOW-3817] Corrected task ids returned by BranchPythonOperator to match the dummy operator ids (#4659)
CHANGELOG.txt:3772:- [AIRFLOW-1314] Rebasing against master
CHANGELOG.txt:4708:- [AIRFLOW-899] Tasks in SCHEDULED state should be white in the UI instead of black
CHANGELOG.txt:4861:- [AIRFLOW-899] Tasks in SCHEDULED state should be white in the UI instead of black
CHANGELOG.txt:5058:- [AIRFLOW-425] Add white fill for null state tasks in tree view.
CHANGELOG.txt:5326:- Merge branch 'master' into hivemeta_sasl
CHANGELOG.txt:5372:- [hotfix] typo that made it in master
CHANGELOG.txt:5375:- Merge remote-tracking branch 'upstream/master' into minicluster
CHANGELOG.txt:5397:- Merge remote-tracking branch 'upstream/master'
CHANGELOG.txt:5418:- Merge branch 'airbnb/master'`

PR related: https://github.com/apache/airflow/pull/23090 any feedback is appreciated it

ashb commented 2 years ago

It is working, should I work to change these words in CHANGELOG: master, dummy..etc?


No relative imports......................................................................Passed
Check for language that we do not accept as community....................................Failed
- hook id: language-matters
- exit code: 1

CHANGELOG.txt:358:- Deprecate dummy trigger rule in favor of always (#17144)

We can't reword that one - it's a notice to the users that the class has changed name.

The others were correct at the time the were written (master branch) so it doesn't make sense to change them, but we could do of we want

edithturn commented 2 years ago

@ashb agree with this, then I'm going to add this file CHANGELOG to exclusions.

leahecole commented 2 years ago

@edithturn thanks for taking this on! +1 to Ash's suggestion to exclude the CHANGELOG. Rewriting history may cause more confusion than it does good - as much as I'm always in favor of upgrading language, it might make it harder to tell in the changelog when this change actually occurred.

edithturn commented 2 years ago

Hello! I have a doubt regarding this:

I am excluding these files:

^airflow/www/fab_security/manager.py$| ^airflow/providers/| ^tests/providers/apache/cassandra/hooks/test_cassandra.py$| ^docs/apache-airflow-providers-apache-cassandra/connections/cassandra.rst$| ^docs/apache-airflow-providers-apache-hive/commits.rst$| ^docs/*.*$| ^tests/providers/| ^.pre-commit-config\.yaml$| ^.*RELEASE_NOTES\.rst$| ^.*CHANGELOG\.txt$|^.*CHANGELOG\.rst$|

And the pre-commit is showing an error in these other files:

Screenshot from 2022-04-26 22-37-49

PR: https://github.com/apache/airflow/pull/23090

I'm not sure how to deal with them, should I exclude them or fix them to change the terms?

eladkal commented 2 years ago

We should fix what we can For example sanity has many alternatives: https://gist.github.com/seanmhanson/fe370c2d8bd2b3228680e38899baf5cc

potiuk commented 2 years ago

I had no idea "sanity check" is a problem. TIL