apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.94k stars 3.4k forks source link

[Python] Type checking support #32609

Open asfimport opened 1 year ago

asfimport commented 1 year ago

mypy and static type checking

As of Python3.6, it has been possible to introduce typing information in the code. This became immensely popular in a short period of time. Shortly after, the tool mypy arrived and this has become the industry standard for static type checking inside Python. It is able to check very quickly for invalid types which makes it possible to serve as a pre-commit. It has raised many bugs that I did not see myself and has been a very valuable tool.

Now what does this mean for PyArrow?

When we run mypy on code that uses PyArrow, you will get error message as follows:

some_util_using_pyarrow/hdfs_utils.py:5: error: Skipping analyzing "pyarrow": module is installed, but missing library stubs or py.typed marker
some_util_using_pyarrow/hdfs_utils.py:9: error: Skipping analyzing "pyarrow": module is installed, but missing library stubs or py.typed marker
some_util_using_pyarrow/hdfs_utils.py:11: error: Skipping analyzing "pyarrow.fs": module is installed, but missing library stubs or py.typed marker

More information is available here: https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-library-stubs-or-py-typed-marker

You can solve this in three ways:

  1. Ignore the message. This, however, will put all types from PyArrow to Any, making it unable to find user errors with the PyArrow library
  2. Create a Python stub file. This is what previously used to be the standard, however, it no longer a popular option. This is because stubs are extra, next to the source code, while you can also inline the code with type hints, which brings me to our third option.
  3. Create a py.typed file and use inline type hints. This is the most popular option today because it requires no extra files (except for the py.typed file), allows all the type hints to be with the code (like now in the documentation) and not only provides your users but also the developers of the library themselves with type hints (and hinting of issues inside your IDE).

     

    My personal opinion already shines through the options, it is 3 as this has shortly become the industry standard since the introduction.

    What should we do?

    I'd very much like to work on this, however, I don't feel like wasting time. Therefore, I am raising this ticket to see if this had been considered before or if we just didn't get to this yet.

    I'd like to open the discussion here:

  4. Do you agree with number #3 as type hints.
  5. Should we remove the documentation annotations for the type hints given they will be inside the functions? Or should we keep it and specify it in the code? Which would make it double.

     

Reporter: Jorrick Sleijster

Note: This issue was originally created as ARROW-17335. Please see the migration documentation for further details.

asfimport commented 1 year ago

Antoine Pitrou / @pitrou: Much of our code is in Cython. If you put the type annotations inline, any change or addition to typing info will require a recompile.

(also I'm assuming that Cython is compatible with type annotations, but I'm not 100% sure)

asfimport commented 1 year ago

Antoine Pitrou / @pitrou: cc @xhochy @jorisvandenbossche

asfimport commented 1 year ago

Jorrick Sleijster: Hi @pitrou,

Thanks for reaching out. I checked out the code indeed and I have to say, it's super clean and I was overwhelmed.

I have no worked with Cython code bases before but if we just changed the \*.py files with type annotations, that would that also require a recompilation?

 

Let's take this line for example: https://github.com/apache/arrow/blob/939195183657daa2060970b6fcd1938eab53d44b/python/pyarrow/hdfs.py#L96


def ls(self, path, detail=False):

That would become


def ls(self, path: str, detail: bool = False) -> List[Dict[str, Any]]:

Would that change require recompilation?

asfimport commented 1 year ago

Antoine Pitrou / @pitrou: No, that wouldn't require recompilation, but some APIs are implemented in Cython.

asfimport commented 1 year ago

David Li / @lidavidm: Arguably also duplicates ARROW-8175

asfimport commented 1 year ago

David Li / @lidavidm: And https://github.com/cython/cython/pull/3818 is relevant

asfimport commented 1 year ago

Jorrick Sleijster: Well it's not really a duplicate of ARROW-8175.

The difference lies in the fact that that ticket is focused perform type checking on the PyArrow code base and ensuring all the types are valid inside the library.

My ticket is about using the PyArrow code base as a library and ensuring we can type check projects that are using PyArrow by using type annotations on functions specified inside the PyArrow codebase.

 

I think PySpark 3.2.2 was a nice example of having stubs: https://github.com/apache/spark/tree/v3.2.2/python/pyspark

I'm pretty sure they created them manually though (and note: this is a Java bindings and not C, but I don't think that's a lot of difference in terms of stubs).

However, they changed it in their latest version to ditch the pyi files. I think this is because they have a lot more percentage of the code in Python compared to Java.

asfimport commented 1 year ago

Jorrick Sleijster: I suppose we can just give it a start and see how far we get?

asfimport commented 1 year ago

Antoine Pitrou / @pitrou: Yes, we can probably give it a start. Probably start with the most basic utilities, for example in memory.pxi.

asfimport commented 1 year ago

Joris Van den Bossche / @jorisvandenbossche: AFAIK it is not (yet) possible to do inline type annotations in cython code (for type checking purposes, see the links in https://github.com/apache/arrow/pull/6676 as well), so I think that basically means we need to use the stub file approach?

(but I certainly agree it's fine to give this a go with a small subset, see how that looks, and discuss further from there)

asfimport commented 1 year ago

Joris Van den Bossche / @jorisvandenbossche:

Well it's not really a duplicate of ARROW-8175.

The difference lies in the fact that that ticket is focused perform type checking on the PyArrow code base and ensuring all the types are valid inside the library.

My ticket is about using the PyArrow code base as a library and ensuring we can type check projects that are using PyArrow by using type annotations on functions specified inside the PyArrow codebase.

It's indeed not exactly the same. But in practice, I think both aspects are very much related and we could (should?) do those at the same time. If we start adding type annotations so that pyarrow can used by other projects that are type-checked, it would be good that at the same time we also check that those type annotations we are adding are correct (although, based on my limited experience with this, just running mypy on the code base is always a bit limited I suppose, as it doesn't guarantee the type checks are actually correct? (it only might find some incorrect ones))

asfimport commented 1 year ago

Jorrick Sleijster: I think you make a good point Joris but as you mention, I don't think we can use inline type annotations :'(. Therefore, we'd have to use generated stubs, which we can't use for checking whether the underlying code actually has the right types.

I think we will therefore have to wait (or take action ourselves upstream) until mypy or cython implements decent support for Python stub generation.

Hence, I think it's better to threat them separate for now and start of with stub generation, which can then later be replaced by a better implementation once available.

asfimport commented 1 year ago

Joris Van den Bossche / @jorisvandenbossche: Mypy doesn't use pyi files when eg doing mypy pyarrow?

asfimport commented 1 year ago

Jorrick Sleijster: It does, in fact, the pyi files are the python-interfaces / python stubs as mentioned here: https://mypy.readthedocs.io/en/stable/stubs.html.

As far as I can see from this page, these are only used for external project checking it's types against this project and not for internal project checking.

asfimport commented 1 year ago

Uwe Korn / @xhochy: My initial efforts with regards to typing the code base stopped as the inline type annotations (and their automatic extraction into pyi) is the crucial component here. All the important data structures of pyarrow are implemented in Cython, only a very small fraction of the externally visible API is in plain Python. Thus as long as the referenced Cython issue isn't solved, I don't think it makes sense to progress on the Arrow side.

asfimport commented 1 year ago

Jorrick Sleijster: Yes I guess it will also be a lot of work to keep it up to date over time if we do major refactors.

Looker at other projects, this seems like a very cumbersome and tedious thing to work on.

For example, this one took a very long time to get right as well; https://github.com/pysam-developers/pysam/pull/1008

I suppose the best it to postpone until https://github.com/python/mypy/issues/7542 is implemented then.

asfimport commented 1 year ago

Joris Van den Bossche / @jorisvandenbossche: FYI pandas does use pyi files next to cython files that have to be kept updated manually if the cython file changes. But, in pandas the cython code is mostly for internal functionality, not for public facing classes and methods.

wjones127 commented 1 year ago

Not planning on looking at this soon. But if someone does, one approach would be to use this branch of stubgen (https://github.com/python/mypy/pull/13284) so that includes the docstrings, and then one can fill in the types based on them. I've heard from @mariusvniekerk that this workflow has helped when he generated type stubs for our Flight submodule for his own project.

On detecting regressions in type stubs for Cython, I think we should be able to catch these as long as we are running type checking on our test suites.

Fokko commented 1 year ago

Great to see the discussion here. I would be in favor of having type annotations. It will make the code more robust, and also helps the user to see what arguments can be passed in.

I'm working on https://github.com/apache/arrow/pull/33974 and figured that also type checking will help to make sure that the docstrings are up to date (if you update a type, you should update the docstring as well).

adriangb commented 1 year ago

Working with the pyarrow library this really sticks out as a sore thumb. Good type stubs and docstrings are IMO just as valuable as API documentation (which the project does very well) because you don't need to leave your IDE and open 10 tabs to find the information.

A couple thoughts:

adriangb commented 2 months ago

@jorisvandenbossche I'm sorry to ping you about this but since we've interacted before you felt like the least worst person to personally annoy. Could we get some thoughts from the team on this? I really think it would not be that hard to get started and improve the typing stubs over time. Based on the 👍🏻 in my comment above there's a good amount of interest and it's easy pickings for 3rd party contributors.

wjones127 commented 2 months ago

I think a good place to start would be to choose a submodule with a decently limited API (maybe fs or flight? Or maybe just data types?) and start there. I think the things we want to demonstrate is:

  1. How much work is it to do? Once we have one example PR of how to add type stubs, that becomes the template for others to make contributions to them. These can become a organized list of good-first-issues that we can crowdsource contributions for to get good coverage quickly, after we have figured out the initial approach.
  2. Do we have a reliable way to validate them in CI? We would want to make sure the stubs are both accurate and cover the whole module.
adriangb commented 2 months ago

How much work is it to do? Once we have one example PR of how to add type stubs, that becomes the template for others to make contributions to them

I'm happy to try to make an initial PR. My approach would be to add a py.typed file and add a .pyi file for some module (happy to do whichever is chosen).

Do we have a reliable way to validate them in CI? We would want to make sure the stubs are both accurate and cover the whole module.

IMO the best way is to force typing into tests. That's also good dogfooding. But it can result in a lot of code churn (need to rewrite a lot of the tests I imagine). Type checkers do allow whitelisting files so it would probably make sense to pick a very small test file to whitelist and either fix all stubs it needs or add # type: ignores. That said, I do think something is better than nothing and these stubs won't impact runtime behavior so even if they are not totally accurate or are missing module members it's still better than no typing at all.

ianmcook commented 2 months ago

FWIW: I recently added a method to PySpark to return a DataFrame as a PyArrow Table (https://github.com/apache/spark/pull/45481). Now I'm trying to add support for going in the other direction (https://github.com/apache/spark/pull/46529) but I'm stymied by type checking problems, including the problem described at https://github.com/apache/arrow/issues/24376#issuecomment-1377869384.

paleolimbot commented 2 months ago

I've experimented a little with this in nanoarrow (where I'd at least like to get autocomplete for methods on some objects that have to be implemented in Cython) and found a few things that might be helpful:

I would lean towards some kind of programmatic approach where type hinting is specified in a docstring or something to minimize the pain of keeping .pxi files synced. Off the top of my head, I would probably start with the mypy-generated stubs, parse it into an ast, do some transformation to add type hints based on some pattern in the docstring or parsing of the argument list (Cython methods provide the file number and line number of the definition). That is almost certainly a can of worms (but so are any alternatives I know about).

The PR adding very basic mypy stubs to nanoarrow is here: https://github.com/apache/arrow-nanoarrow/pull/468

westonpace commented 1 month ago

I'm happy to try to make an initial PR. My approach would be to add a py.typed file and add a .pyi file for some module (happy to do whichever is chosen).

@adriangb

Looking through this discussion I don't think anyone is opposed to an initial PR and it would be welcome. Picking a small module would be good to prove out the actual workflow.

I think the next step then is to provide typings for RecordBatch, Table, Schema, Field, and Array. I feel like these are the most essential types and if we had type hints for just these classes I feel like the user experience would be much better.