apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.65k stars 3.55k forks source link

[Python] Type checking support #32609

Open asfimport opened 2 years ago

asfimport commented 2 years ago

mypy and static type checking

As of Python3.6, it has been possible to introduce typing information in the code. This became immensely popular in a short period of time. Shortly after, the tool mypy arrived and this has become the industry standard for static type checking inside Python. It is able to check very quickly for invalid types which makes it possible to serve as a pre-commit. It has raised many bugs that I did not see myself and has been a very valuable tool.

Now what does this mean for PyArrow?

When we run mypy on code that uses PyArrow, you will get error message as follows:

some_util_using_pyarrow/hdfs_utils.py:5: error: Skipping analyzing "pyarrow": module is installed, but missing library stubs or py.typed marker
some_util_using_pyarrow/hdfs_utils.py:9: error: Skipping analyzing "pyarrow": module is installed, but missing library stubs or py.typed marker
some_util_using_pyarrow/hdfs_utils.py:11: error: Skipping analyzing "pyarrow.fs": module is installed, but missing library stubs or py.typed marker

More information is available here: https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-library-stubs-or-py-typed-marker

You can solve this in three ways:

  1. Ignore the message. This, however, will put all types from PyArrow to Any, making it unable to find user errors with the PyArrow library
  2. Create a Python stub file. This is what previously used to be the standard, however, it no longer a popular option. This is because stubs are extra, next to the source code, while you can also inline the code with type hints, which brings me to our third option.
  3. Create a py.typed file and use inline type hints. This is the most popular option today because it requires no extra files (except for the py.typed file), allows all the type hints to be with the code (like now in the documentation) and not only provides your users but also the developers of the library themselves with type hints (and hinting of issues inside your IDE).

     

    My personal opinion already shines through the options, it is 3 as this has shortly become the industry standard since the introduction.

    What should we do?

    I'd very much like to work on this, however, I don't feel like wasting time. Therefore, I am raising this ticket to see if this had been considered before or if we just didn't get to this yet.

    I'd like to open the discussion here:

  4. Do you agree with number #3 as type hints.
  5. Should we remove the documentation annotations for the type hints given they will be inside the functions? Or should we keep it and specify it in the code? Which would make it double.

     

Reporter: Jorrick Sleijster

Note: This issue was originally created as ARROW-17335. Please see the migration documentation for further details.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Much of our code is in Cython. If you put the type annotations inline, any change or addition to typing info will require a recompile.

(also I'm assuming that Cython is compatible with type annotations, but I'm not 100% sure)

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: cc @xhochy @jorisvandenbossche

asfimport commented 2 years ago

Jorrick Sleijster: Hi @pitrou,

Thanks for reaching out. I checked out the code indeed and I have to say, it's super clean and I was overwhelmed.

I have no worked with Cython code bases before but if we just changed the \*.py files with type annotations, that would that also require a recompilation?

 

Let's take this line for example: https://github.com/apache/arrow/blob/939195183657daa2060970b6fcd1938eab53d44b/python/pyarrow/hdfs.py#L96


def ls(self, path, detail=False):

That would become


def ls(self, path: str, detail: bool = False) -> List[Dict[str, Any]]:

Would that change require recompilation?

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: No, that wouldn't require recompilation, but some APIs are implemented in Cython.

asfimport commented 2 years ago

David Li / @lidavidm: Arguably also duplicates ARROW-8175 (https://github.com/apache/arrow/issues/24376)

asfimport commented 2 years ago

David Li / @lidavidm: And https://github.com/cython/cython/pull/3818 is relevant

asfimport commented 2 years ago

Jorrick Sleijster: Well it's not really a duplicate of ARROW-8175.

The difference lies in the fact that that ticket is focused perform type checking on the PyArrow code base and ensuring all the types are valid inside the library.

My ticket is about using the PyArrow code base as a library and ensuring we can type check projects that are using PyArrow by using type annotations on functions specified inside the PyArrow codebase.

 

I think PySpark 3.2.2 was a nice example of having stubs: https://github.com/apache/spark/tree/v3.2.2/python/pyspark

I'm pretty sure they created them manually though (and note: this is a Java bindings and not C, but I don't think that's a lot of difference in terms of stubs).

However, they changed it in their latest version to ditch the pyi files. I think this is because they have a lot more percentage of the code in Python compared to Java.

asfimport commented 2 years ago

Jorrick Sleijster: I suppose we can just give it a start and see how far we get?

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Yes, we can probably give it a start. Probably start with the most basic utilities, for example in memory.pxi.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: AFAIK it is not (yet) possible to do inline type annotations in cython code (for type checking purposes, see the links in https://github.com/apache/arrow/pull/6676 as well), so I think that basically means we need to use the stub file approach?

(but I certainly agree it's fine to give this a go with a small subset, see how that looks, and discuss further from there)

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche:

Well it's not really a duplicate of ARROW-8175.

The difference lies in the fact that that ticket is focused perform type checking on the PyArrow code base and ensuring all the types are valid inside the library.

My ticket is about using the PyArrow code base as a library and ensuring we can type check projects that are using PyArrow by using type annotations on functions specified inside the PyArrow codebase.

It's indeed not exactly the same. But in practice, I think both aspects are very much related and we could (should?) do those at the same time. If we start adding type annotations so that pyarrow can used by other projects that are type-checked, it would be good that at the same time we also check that those type annotations we are adding are correct (although, based on my limited experience with this, just running mypy on the code base is always a bit limited I suppose, as it doesn't guarantee the type checks are actually correct? (it only might find some incorrect ones))

asfimport commented 2 years ago

Jorrick Sleijster: I think you make a good point Joris but as you mention, I don't think we can use inline type annotations :'(. Therefore, we'd have to use generated stubs, which we can't use for checking whether the underlying code actually has the right types.

I think we will therefore have to wait (or take action ourselves upstream) until mypy or cython implements decent support for Python stub generation.

Hence, I think it's better to threat them separate for now and start of with stub generation, which can then later be replaced by a better implementation once available.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: Mypy doesn't use pyi files when eg doing mypy pyarrow?

asfimport commented 2 years ago

Jorrick Sleijster: It does, in fact, the pyi files are the python-interfaces / python stubs as mentioned here: https://mypy.readthedocs.io/en/stable/stubs.html.

As far as I can see from this page, these are only used for external project checking it's types against this project and not for internal project checking.

asfimport commented 2 years ago

Uwe Korn / @xhochy: My initial efforts with regards to typing the code base stopped as the inline type annotations (and their automatic extraction into pyi) is the crucial component here. All the important data structures of pyarrow are implemented in Cython, only a very small fraction of the externally visible API is in plain Python. Thus as long as the referenced Cython issue isn't solved, I don't think it makes sense to progress on the Arrow side.

asfimport commented 2 years ago

Jorrick Sleijster: Yes I guess it will also be a lot of work to keep it up to date over time if we do major refactors.

Looker at other projects, this seems like a very cumbersome and tedious thing to work on.

For example, this one took a very long time to get right as well; https://github.com/pysam-developers/pysam/pull/1008

I suppose the best it to postpone until https://github.com/python/mypy/issues/7542 is implemented then.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: FYI pandas does use pyi files next to cython files that have to be kept updated manually if the cython file changes. But, in pandas the cython code is mostly for internal functionality, not for public facing classes and methods.

wjones127 commented 1 year ago

Not planning on looking at this soon. But if someone does, one approach would be to use this branch of stubgen (https://github.com/python/mypy/pull/13284) so that includes the docstrings, and then one can fill in the types based on them. I've heard from @mariusvniekerk that this workflow has helped when he generated type stubs for our Flight submodule for his own project.

On detecting regressions in type stubs for Cython, I think we should be able to catch these as long as we are running type checking on our test suites.

Fokko commented 1 year ago

Great to see the discussion here. I would be in favor of having type annotations. It will make the code more robust, and also helps the user to see what arguments can be passed in.

I'm working on https://github.com/apache/arrow/pull/33974 and figured that also type checking will help to make sure that the docstrings are up to date (if you update a type, you should update the docstring as well).

adriangb commented 1 year ago

Working with the pyarrow library this really sticks out as a sore thumb. Good type stubs and docstrings are IMO just as valuable as API documentation (which the project does very well) because you don't need to leave your IDE and open 10 tabs to find the information.

A couple thoughts:

adriangb commented 6 months ago

@jorisvandenbossche I'm sorry to ping you about this but since we've interacted before you felt like the least worst person to personally annoy. Could we get some thoughts from the team on this? I really think it would not be that hard to get started and improve the typing stubs over time. Based on the 👍🏻 in my comment above there's a good amount of interest and it's easy pickings for 3rd party contributors.

wjones127 commented 6 months ago

I think a good place to start would be to choose a submodule with a decently limited API (maybe fs or flight? Or maybe just data types?) and start there. I think the things we want to demonstrate is:

  1. How much work is it to do? Once we have one example PR of how to add type stubs, that becomes the template for others to make contributions to them. These can become a organized list of good-first-issues that we can crowdsource contributions for to get good coverage quickly, after we have figured out the initial approach.
  2. Do we have a reliable way to validate them in CI? We would want to make sure the stubs are both accurate and cover the whole module.
adriangb commented 6 months ago

How much work is it to do? Once we have one example PR of how to add type stubs, that becomes the template for others to make contributions to them

I'm happy to try to make an initial PR. My approach would be to add a py.typed file and add a .pyi file for some module (happy to do whichever is chosen).

Do we have a reliable way to validate them in CI? We would want to make sure the stubs are both accurate and cover the whole module.

IMO the best way is to force typing into tests. That's also good dogfooding. But it can result in a lot of code churn (need to rewrite a lot of the tests I imagine). Type checkers do allow whitelisting files so it would probably make sense to pick a very small test file to whitelist and either fix all stubs it needs or add # type: ignores. That said, I do think something is better than nothing and these stubs won't impact runtime behavior so even if they are not totally accurate or are missing module members it's still better than no typing at all.

ianmcook commented 6 months ago

FWIW: I recently added a method to PySpark to return a DataFrame as a PyArrow Table (https://github.com/apache/spark/pull/45481). Now I'm trying to add support for going in the other direction (https://github.com/apache/spark/pull/46529) but I'm stymied by type checking problems, including the problem described at https://github.com/apache/arrow/issues/24376#issuecomment-1377869384.

paleolimbot commented 6 months ago

I've experimented a little with this in nanoarrow (where I'd at least like to get autocomplete for methods on some objects that have to be implemented in Cython) and found a few things that might be helpful:

I would lean towards some kind of programmatic approach where type hinting is specified in a docstring or something to minimize the pain of keeping .pxi files synced. Off the top of my head, I would probably start with the mypy-generated stubs, parse it into an ast, do some transformation to add type hints based on some pattern in the docstring or parsing of the argument list (Cython methods provide the file number and line number of the definition). That is almost certainly a can of worms (but so are any alternatives I know about).

The PR adding very basic mypy stubs to nanoarrow is here: https://github.com/apache/arrow-nanoarrow/pull/468

westonpace commented 5 months ago

I'm happy to try to make an initial PR. My approach would be to add a py.typed file and add a .pyi file for some module (happy to do whichever is chosen).

@adriangb

Looking through this discussion I don't think anyone is opposed to an initial PR and it would be welcome. Picking a small module would be good to prove out the actual workflow.

I think the next step then is to provide typings for RecordBatch, Table, Schema, Field, and Array. I feel like these are the most essential types and if we had type hints for just these classes I feel like the user experience would be much better.

darkclouder commented 4 months ago

Seems like someone already did some groundwork a few years ago: https://github.com/zen-xu/pyarrow-stubs

kylebarron commented 4 months ago

2. Do we have a reliable way to validate them in CI?

I've used https://github.com/typeddjango/pytest-mypy-plugins before and it works rather well.

zen-xu commented 2 months ago

In the past few days, I rewrote pyarrow-stubs instead of generating them through stubgen

jorisvandenbossche commented 2 months ago

@zen-xu thanks a lot for providing that package!

[Adrian] I think this is a case of something is better than nothing.

Yes, that's a good reminder. But just to be sure, my understanding from reading https://typing.readthedocs.io/en/latest/spec/distributing.html#import-resolution-ordering is that if we would start to add some type hints gradually, already add a py.typed file, and release the next pyarrow with partial type hints, that this will not override the pyarrow-stubs package (i.e. someone who wants the more complete type hints from there can still do that by just installing that package). Right?

But if some type checker would vendor those (although no idea if that happens, and pyarrow is not in https://github.com/python/typeshed or https://github.com/microsoft/python-type-stubs), that would no longer get picked up?

[Weston] Looking through this discussion I don't think anyone is opposed to an initial PR and it would be welcome. Picking a small module would be good to prove out the actual workflow.

Yes, agreed, and I will try to do such an initial PR next week to get things started.
I assume for pure python files, we will just want to use inline annotations, and so that is the easiest place to get started?

And I also assume that the priority would be to add return types (so that at least the use case of autocomplete in IDEs would work). Is that a correct analysis?

However, the most relevant parts are of course in cython. For those we need to add .pyi stub files. And unfortunately, it does not seem that the situation has improved much the last two years with regards to generating them from the cython sources (cython branch is stale, PR in mypy was closed, and also tried out a few other, and none of those produced something decent out of the box). Will think about some options in a next comment.

jorisvandenbossche commented 2 months ago

The end goal should be that we have rather complete type annotations for the users of pyarrow, i.e. with most part of pyarrow being in cython. Thinking through some options how to get there, and how to maintain and distribute those type stubs:

Thoughts on this? Preferences? Other options you can think of?
(as someone new to this part of Python, I do appreciate feedback and the chance that I might got things quite wrong!)


Personally, I think that if we have a decent solution for auto-generation that produces "good enough" stubs, that would be my preference. I have been looking into some of the existing (partial / abandoned) solutions, and I do think we can get something working short term. For example, I think it should be possible to get mypy's stubgen working to recognize cython's cyfunction as a normal function with minimal patches. Otherwise a similar approach as mpi4py could also work. Or the suggestion of adding inline comments with type hints an a script to extract those. And longer term we could look into reviving the PR in the cython repo to generate stub files.

zen-xu commented 2 months ago

Thank you for your recognition. I accidentally discovered that the package I created a long time ago in my free time could be helpful to you, and I’m happy to contribute code to your project.

zen-xu commented 2 months ago

Additionally, if pyarrow needs to maintain its own type annotations, I recommend using the wrapper pattern, as .pyi files are limited in that they cannot include documentation.

For example, if there is a function add in _lib.so, we can define a function with the same name in lib.py, add type annotations and documentation to it, and have it call _lib.add.

kylebarron commented 2 months ago

.pyi files are limited in that they cannot include documentation.

.pyi files can include documentation but not all tools will look for documentation in .pyi files. In particular, .pyi files will not set __doc__ on objects. But tools like vscode and https://github.com/mkdocstrings/python will use docstrings from .pyi files.