Add Python type annotations

zero323 commented 4 years ago

This PR adds type annotations for Python module, provided as stub files.

This should:

Improve auto-completion performance in editors which support type hints.
Optionally enable static checking through mypy or built-in tools (for example Pycharm tools).
Provide additional way to document usage of certain methods, where arguments are required, despite providing default values.

databricks-cla-assistant commented 4 years ago

All committers have signed the CLA.

tdas commented 4 years ago

@zero323 my apologies for not having taken a look at this PR. I am unfamiliar with this aspect of python. can you tell me more about these annotations? Specifically

How does it affect forward and backward compatibility?
Does these annotations always have to be in sync with the code? What happens if they are not in sync (e.g., we add a method in the python code but forget to add it to the annotation)?

zero323 commented 4 years ago

@zero323 my apologies for not having taken a look at this PR.

No worries.

I am unfamiliar with this aspect of python. can you tell me more about these annotations? Specifically

How does it affect forward and backward compatibility?

In case of stub files there is no practical on forward or backward compatibility (inline annotation, not proposed here, are not backward compatible beyond Python 3.5). Annotations are only used by external tools, and have no impact on actual execution.

Does these annotations always have to be in sync with the code? What happens if they are not in sync (e.g., we add a method in the python code but forget to add it to the annotation)?

Obviously this is the preferred to keep things consistent, as annotations which are not synced, provide misleading feedback. This might be a problem, if user use CI pipelines, which are aware of type annotations (type checkers and such).

Other than that, there is no impact at all (might be in the future, if users couple things with emerging tools like mypyc).

zsxwing commented 2 years ago

@zero323 Thanks for the contribution. Are you still interested in this? If so, could you update this? This would be very helpful and would help us find issues such as #726

zero323 commented 2 years ago

@zsxwing Sure.

Since I opened this PR, Spark started moving from stubs to inline hints. Should we consider the same here?

tdas commented 2 years ago

can you show examples of where inline hints are used in spark. i still see a lot of stubs .pyi in the spark file.
are there any version constraints of using either stubs or hints within python 3?

zero323 commented 2 years ago

@tdas

can you show examples of where inline hints are used in spark. i still see a lot of stubs .pyi in the spark file.

SPARK-36845

are there any version constraints of using either stubs or hints within python 3?

If you're asking about versions that are not passed their End-of-life, then no. However, there are some considerations regarding usage of special features (Literals, Protocols etc.), which, depending on a version, might require typing-extensions or might be unavailable.

zero323 commented 2 years ago

@zsxwing @tdas Annotations and supporting code should be up-to-date now.

So, about switching to in-line annotations.... Any thoughts? If you're motivation is to support delta devs, as much as the users, in-line annotations are the way to go (in general type checkers don't validate the code, when corresponding stubs are present. This has pros and cons, but doesn't seem like what you're looking for now).

zsxwing commented 2 years ago

Since I opened this PR, Spark started moving from stubs to inline hints. Should we consider the same here?

@zero323 Thanks for pointing out this. I think that's better. Could you update your PR using inline hints?

zero323 commented 2 years ago

Could you update your PR using inline hints?

I'll work on that, but it might take a while.

zero323 commented 2 years ago

@zsxwing, could you be so kind and approve the workflow? TIA!

zsxwing commented 2 years ago

could you be so kind and approve the workflow? TIA!

Done. Thanks for the update!

scottsand-db commented 2 years ago

Hi @zero323. We are reviewing this now. Please know, we won't get around to merging this until after the Delta OSS 1.1 release

zero323 commented 2 years ago

Thanks for the update @scottsand-db

scottsand-db commented 2 years ago

@xinrong-databricks and @ueshin what's the status on the re-review?

xinrong-meng commented 2 years ago

LGTM if tests passed. Thanks!

scottsand-db commented 2 years ago

Hi @zero323 thanks for all the work on this PR. Almost there! this PR is failing python tests since commit 417dc53 with error (below). Can you please take a look and fix the failing tests?

python/delta/tests/test_deltatable.py:404: error: Invalid self argument "RDD[Row]" to attribute function "collectAsMap" with type "Callable[[RDD[Tuple[K, V]]], Dict[K, V]]"  [misc]
python/delta/tests/test_deltatable.py:404: error: Incompatible types in assignment (expression has type "Dict[<nothing>, <nothing>]", variable has type "Dict[str, str]")  [assignment]
Found 2 errors in 1 file (checked 12 source files)
Deleted /home/runner/.ivy2/cache/io.delta 
### Executing cmd: /home/runner/work/delta/delta/build/sbt clean publishM2
### Executing cmd: /home/runner/work/delta/delta/dev/lint-python
##### Running mypy tests #####
### Executing cmd: mypy --config-file /home/runner/work/delta/delta/python/mypy.ini /home/runner/work/delta/delta/python/delta
Traceback (most recent call last):
  File "python/run-tests.py", line 175, in <module>
    run_mypy_tests(root_dir)
  File "python/run-tests.py", line 113, in run_mypy_tests
    ], stream_output=True)
  File "python/run-tests.py", line 83, in run_cmd
    raise Exception("Non-zero exitcode: %s" % (exit_code))
Exception: Non-zero exitcode: 1
##### Running SBT tests #####
##### Running Python tests #####
Calling script %s python/run-tests.py
Traceback (most recent call last):
  File "run-tests.py", line 76, in <module>
    run_python_tests(root_dir)
  File "run-tests.py", line 34, in run_python_tests
    run_cmd(["python", python_test_script], stream_output=True)
  File "run-tests.py", line 46, in run_cmd
    raise Exception("Non-zero exitcode: %s" % (exit_code))
Exception: Non-zero exitcode: 1
Error: Process completed with exit code 1.

zero323 commented 2 years ago

@scottsand-db I've seen this and I am investigating, but getting inconsistent behavior for ignores between local dev environment

➜  delta git:(annotations) git rev-parse HEAD
4ea8fff69a4631c91984abcc854a8fedcd26677c
➜  delta git:(annotations) mypy --version                                 
mypy 0.910
➜  delta git:(annotations) rm -Rf .mypy_cache                             
➜  delta git:(annotations) sha256sum python/delta/tests/test_deltatable.py
d34acbe3ad488c65b1734fadf03e1732161f2c8a27e26b8ece72bb39d4030670  python/delta/tests/test_deltatable.py
➜  delta git:(annotations) mypy --no-incremental --config-file python/mypy.ini python/delta 
Success: no issues found in 12 source files

and CI setup

dev@7fd042106847:~/delta$ git rev-parse HEAD
4ea8fff69a4631c91984abcc854a8fedcd26677c
dev@7fd042106847:~/delta$ pipenv run mypy --version
mypy 0.910
dev@7fd042106847:~/delta$ sha256sum python/delta/tests/test_deltatable.py 
d34acbe3ad488c65b1734fadf03e1732161f2c8a27e26b8ece72bb39d4030670  python/delta/tests/test_deltatable.py
dev@7fd042106847:~/delta$ rm -Rf .mypy_cache/
dev@7fd042106847:~/delta$ pipenv run mypy --no-incremental --config-file python/mypy.ini python/delta
python/delta/tests/test_deltatable.py:404: error: Invalid self argument "RDD[Row]" to attribute function "collectAsMap" with type "Callable[[RDD[Tuple[K, V]]], Dict[K, V]]"  [misc]
python/delta/tests/test_deltatable.py:404: error: Incompatible types in assignment (expression has type "Dict[<nothing>, <nothing>]", variable has type "Dict[str, str]")  [assignment]
Found 2 errors in 1 file (checked 12 source files)

I'll ping you all, once I figure out what causes the issue.

zero323 commented 2 years ago

Current error seems to be unrelated to PR (already present on https://github.com/delta-io/delta/commit/5571d95966e3b03c4699dcfa1a5cd86bb16447dc)

 ImportError: cannot import name 'InstallCommand' from 'pipenv.vendor.pip_shims.shims' (/usr/local/lib/python3.8/dist-packages/pipenv/vendor/pip_shims/shims.py)
/usr/share/python-wheels/requests-2.22.0-py2.py3-none-any.whl/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.6) or chardet (3.0.4) doesn't match a supported version!

zsxwing commented 2 years ago

@zero323 I'm fixing it in https://github.com/delta-io/delta/pull/821

zsxwing commented 2 years ago

@zero323 you can merge master into your PR to fix the build.

zero323 commented 2 years ago

Thanks @zsxwing!

zero323 commented 2 years ago

I really cannot tell what causes difference in ignore handling between CI and local env, but both seem to be happy with the current position of type: ignore (seems weird to me, TBH).

scottsand-db commented 2 years ago

@zero323 do we need the empty file python/delta/py.typed ?

zero323 commented 2 years ago

@zero323 do we need the empty file python/delta/py.typed ?

Good catch @scottsand-db! Yes, we definitely do. That's what makes package recognized as supporting type hints (PEP 561). Which reminded me that it should be mentioned in setup (added 267b6be)

scottsand-db commented 2 years ago

Hi @@zero323 , thanks for all of this work. I'm in the process of QAing and merging this ticket.

Can you please provide me with a tangible example of where + how this PR improves python typed annotations? and how you tested that (e.g. in some file X, using IntelliJ/vscode, Y is now visible, whereas Y was not visible before this change)?

I appreciate the help - not much of a python expert myself.

zero323 commented 2 years ago

@scottsand-db

Can you please provide me with a tangible example of where + how this PR improves python typed annotations? and how you tested that?

I appreciate the help - not much of a python expert myself.

A good starting point to understand the impact is to see corresponding Spark JIRA ticket. To summarize, there three main aspects:

Precise documentation. Expected input and output types are clearly defined. Furthermore, we can describe possible relations between arguments. For delta, the most obvious example are overloads of functions with counter-intuitive signatures. For example, take a look at whenMatchedUpdate.

Because of this design decision

https://github.com/delta-io/delta/blob/267b6be43b206924165baed3e51660f1379c28db/python/delta/tables.py#L714-L715

The original signature

https://github.com/delta-io/delta/blob/0f30f6f47761cb01043d84aad182f30cb59b5e31/python/delta/tables.py#L596

is misleading, suggesting that is OK to omit set. In contrast, annotated code clearly indicates that set is mandatory:

https://github.com/delta-io/delta/blob/267b6be43b206924165baed3e51660f1379c28db/python/delta/tables.py#L688-L704
Ability to statically detect many code issues.

This is already visible in the delta codebase, where mypy can detect incorrect calls within tests intended to fail ‒ hence all the ignores, i.e.

https://github.com/delta-io/delta/blob/267b6be43b206924165baed3e51660f1379c28db/python/delta/tests/test_deltatable.py#L107-L126

We focus on mypy here, as the standard Python type checker, but depending on the environment (IDEs can come with their own type checkers providing different set of features) and type checker (there quite a few) your experience might be different.
Finally, type hints enable better syntax completion. Again, specific behavior depends on your environment (@ueshin and @xinrong-databricks might be able to answer how this interacts with Databricks notebooks), but here is a little demo how experience changes in IPython (which depends on awesome jedi library by davidhalter ‒ common choice for different projects) between 0f30f6f47761cb01043d84aad182f30cb59b5e31 and 267b6be43b206924165baed3e51660f1379c28db

scottsand-db commented 2 years ago

Thank you!

zero323 commented 2 years ago

Thanks all!

delta-io / delta

Add Python type annotations #305