Open asfimport opened 2 years ago
Antoine Pitrou / @pitrou: Much of our code is in Cython. If you put the type annotations inline, any change or addition to typing info will require a recompile.
(also I'm assuming that Cython is compatible with type annotations, but I'm not 100% sure)
Antoine Pitrou / @pitrou: cc @xhochy @jorisvandenbossche
Jorrick Sleijster: Hi @pitrou,
Thanks for reaching out. I checked out the code indeed and I have to say, it's super clean and I was overwhelmed.
I have no worked with Cython code bases before but if we just changed the \*.py
files with type annotations, that would that also require a recompilation?
Let's take this line for example: https://github.com/apache/arrow/blob/939195183657daa2060970b6fcd1938eab53d44b/python/pyarrow/hdfs.py#L96
def ls(self, path, detail=False):
That would become
def ls(self, path: str, detail: bool = False) -> List[Dict[str, Any]]:
Would that change require recompilation?
Antoine Pitrou / @pitrou: No, that wouldn't require recompilation, but some APIs are implemented in Cython.
David Li / @lidavidm: Arguably also duplicates ARROW-8175 (https://github.com/apache/arrow/issues/24376)
David Li / @lidavidm: And https://github.com/cython/cython/pull/3818 is relevant
Jorrick Sleijster: Well it's not really a duplicate of ARROW-8175.
The difference lies in the fact that that ticket is focused perform type checking on the PyArrow code base and ensuring all the types are valid inside the library.
My ticket is about using the PyArrow code base as a library and ensuring we can type check projects that are using PyArrow by using type annotations on functions specified inside the PyArrow codebase.
I think PySpark 3.2.2 was a nice example of having stubs: https://github.com/apache/spark/tree/v3.2.2/python/pyspark
I'm pretty sure they created them manually though (and note: this is a Java bindings and not C, but I don't think that's a lot of difference in terms of stubs).
However, they changed it in their latest version to ditch the pyi files. I think this is because they have a lot more percentage of the code in Python compared to Java.
Jorrick Sleijster: I suppose we can just give it a start and see how far we get?
Antoine Pitrou / @pitrou:
Yes, we can probably give it a start. Probably start with the most basic utilities, for example in memory.pxi
.
Joris Van den Bossche / @jorisvandenbossche: AFAIK it is not (yet) possible to do inline type annotations in cython code (for type checking purposes, see the links in https://github.com/apache/arrow/pull/6676 as well), so I think that basically means we need to use the stub file approach?
(but I certainly agree it's fine to give this a go with a small subset, see how that looks, and discuss further from there)
Joris Van den Bossche / @jorisvandenbossche:
Well it's not really a duplicate of ARROW-8175.
The difference lies in the fact that that ticket is focused perform type checking on the PyArrow code base and ensuring all the types are valid inside the library.
My ticket is about using the PyArrow code base as a library and ensuring we can type check projects that are using PyArrow by using type annotations on functions specified inside the PyArrow codebase.
It's indeed not exactly the same. But in practice, I think both aspects are very much related and we could (should?) do those at the same time. If we start adding type annotations so that pyarrow can used by other projects that are type-checked, it would be good that at the same time we also check that those type annotations we are adding are correct (although, based on my limited experience with this, just running mypy on the code base is always a bit limited I suppose, as it doesn't guarantee the type checks are actually correct? (it only might find some incorrect ones))
Jorrick Sleijster: I think you make a good point Joris but as you mention, I don't think we can use inline type annotations :'(. Therefore, we'd have to use generated stubs, which we can't use for checking whether the underlying code actually has the right types.
I think we will therefore have to wait (or take action ourselves upstream) until mypy or cython implements decent support for Python stub generation.
Hence, I think it's better to threat them separate for now and start of with stub generation, which can then later be replaced by a better implementation once available.
Joris Van den Bossche / @jorisvandenbossche:
Mypy doesn't use pyi files when eg doing mypy pyarrow
?
Jorrick Sleijster: It does, in fact, the pyi files are the python-interfaces / python stubs as mentioned here: https://mypy.readthedocs.io/en/stable/stubs.html.
As far as I can see from this page, these are only used for external project checking it's types against this project and not for internal project checking.
Uwe Korn / @xhochy: My initial efforts with regards to typing the code base stopped as the inline type annotations (and their automatic extraction into pyi) is the crucial component here. All the important data structures of pyarrow are implemented in Cython, only a very small fraction of the externally visible API is in plain Python. Thus as long as the referenced Cython issue isn't solved, I don't think it makes sense to progress on the Arrow side.
Jorrick Sleijster: Yes I guess it will also be a lot of work to keep it up to date over time if we do major refactors.
Looker at other projects, this seems like a very cumbersome and tedious thing to work on.
For example, this one took a very long time to get right as well; https://github.com/pysam-developers/pysam/pull/1008
I suppose the best it to postpone until https://github.com/python/mypy/issues/7542 is implemented then.
Joris Van den Bossche / @jorisvandenbossche: FYI pandas does use pyi files next to cython files that have to be kept updated manually if the cython file changes. But, in pandas the cython code is mostly for internal functionality, not for public facing classes and methods.
Not planning on looking at this soon. But if someone does, one approach would be to use this branch of stubgen (https://github.com/python/mypy/pull/13284) so that includes the docstrings, and then one can fill in the types based on them. I've heard from @mariusvniekerk that this workflow has helped when he generated type stubs for our Flight submodule for his own project.
On detecting regressions in type stubs for Cython, I think we should be able to catch these as long as we are running type checking on our test suites.
Great to see the discussion here. I would be in favor of having type annotations. It will make the code more robust, and also helps the user to see what arguments can be passed in.
I'm working on https://github.com/apache/arrow/pull/33974 and figured that also type checking will help to make sure that the docstrings are up to date (if you update a type, you should update the docstring as well).
Working with the pyarrow library this really sticks out as a sore thumb. Good type stubs and docstrings are IMO just as valuable as API documentation (which the project does very well) because you don't need to leave your IDE and open 10 tabs to find the information.
A couple thoughts:
@jorisvandenbossche I'm sorry to ping you about this but since we've interacted before you felt like the least worst person to personally annoy. Could we get some thoughts from the team on this? I really think it would not be that hard to get started and improve the typing stubs over time. Based on the 👍🏻 in my comment above there's a good amount of interest and it's easy pickings for 3rd party contributors.
I think a good place to start would be to choose a submodule with a decently limited API (maybe fs
or flight
? Or maybe just data types
?) and start there. I think the things we want to demonstrate is:
good-first-issues
that we can crowdsource contributions for to get good coverage quickly, after we have figured out the initial approach.How much work is it to do? Once we have one example PR of how to add type stubs, that becomes the template for others to make contributions to them
I'm happy to try to make an initial PR. My approach would be to add a py.typed
file and add a .pyi
file for some module (happy to do whichever is chosen).
Do we have a reliable way to validate them in CI? We would want to make sure the stubs are both accurate and cover the whole module.
IMO the best way is to force typing into tests. That's also good dogfooding. But it can result in a lot of code churn (need to rewrite a lot of the tests I imagine). Type checkers do allow whitelisting files so it would probably make sense to pick a very small test file to whitelist and either fix all stubs it needs or add # type: ignore
s. That said, I do think something is better than nothing and these stubs won't impact runtime behavior so even if they are not totally accurate or are missing module members it's still better than no typing at all.
FWIW: I recently added a method to PySpark to return a DataFrame as a PyArrow Table (https://github.com/apache/spark/pull/45481). Now I'm trying to add support for going in the other direction (https://github.com/apache/spark/pull/46529) but I'm stymied by type checking problems, including the problem described at https://github.com/apache/arrow/issues/24376#issuecomment-1377869384.
I've experimented a little with this in nanoarrow (where I'd at least like to get autocomplete for methods on some objects that have to be implemented in Cython) and found a few things that might be helpful:
def int32() -> DataType
).I would lean towards some kind of programmatic approach where type hinting is specified in a docstring or something to minimize the pain of keeping .pxi files synced. Off the top of my head, I would probably start with the mypy-generated stubs, parse it into an ast, do some transformation to add type hints based on some pattern in the docstring or parsing of the argument list (Cython methods provide the file number and line number of the definition). That is almost certainly a can of worms (but so are any alternatives I know about).
The PR adding very basic mypy stubs to nanoarrow is here: https://github.com/apache/arrow-nanoarrow/pull/468
I'm happy to try to make an initial PR. My approach would be to add a py.typed file and add a .pyi file for some module (happy to do whichever is chosen).
@adriangb
Looking through this discussion I don't think anyone is opposed to an initial PR and it would be welcome. Picking a small module would be good to prove out the actual workflow.
I think the next step then is to provide typings for RecordBatch
, Table
, Schema
, Field
, and Array
. I feel like these are the most essential types and if we had type hints for just these classes I feel like the user experience would be much better.
Seems like someone already did some groundwork a few years ago: https://github.com/zen-xu/pyarrow-stubs
2. Do we have a reliable way to validate them in CI?
I've used https://github.com/typeddjango/pytest-mypy-plugins before and it works rather well.
In the past few days, I rewrote pyarrow-stubs instead of generating them through stubgen
@zen-xu thanks a lot for providing that package!
[Adrian] I think this is a case of something is better than nothing.
Yes, that's a good reminder. But just to be sure, my understanding from reading https://typing.readthedocs.io/en/latest/spec/distributing.html#import-resolution-ordering is that if we would start to add some type hints gradually, already add a py.typed
file, and release the next pyarrow with partial type hints, that this will not override the pyarrow-stubs
package (i.e. someone who wants the more complete type hints from there can still do that by just installing that package). Right?
But if some type checker would vendor those (although no idea if that happens, and pyarrow is not in https://github.com/python/typeshed or https://github.com/microsoft/python-type-stubs), that would no longer get picked up?
[Weston] Looking through this discussion I don't think anyone is opposed to an initial PR and it would be welcome. Picking a small module would be good to prove out the actual workflow.
Yes, agreed, and I will try to do such an initial PR next week to get things started.
I assume for pure python files, we will just want to use inline annotations, and so that is the easiest place to get started?
And I also assume that the priority would be to add return types (so that at least the use case of autocomplete in IDEs would work). Is that a correct analysis?
However, the most relevant parts are of course in cython. For those we need to add .pyi
stub files. And unfortunately, it does not seem that the situation has improved much the last two years with regards to generating them from the cython sources (cython branch is stale, PR in mypy was closed, and also tried out a few other, and none of those produced something decent out of the box). Will think about some options in a next comment.
The end goal should be that we have rather complete type annotations for the users of pyarrow, i.e. with most part of pyarrow being in cython. Thinking through some options how to get there, and how to maintain and distribute those type stubs:
pa.array(..)
overloads). Of course, even if pyarrow installs type stubs itself, users could still got to pyarrow-stubs for more advanced stubs.pyarrow-stubs
in pyarrow itself, i.e. hand-written stub files.
pyarrow-stubs
would already be useful as well, and as mentioned above people who want more advanced stubs can always still install the external stubs), and then additionally the effort of keeping those stubs in sync with the cython source code while changes are made (mypy's stubtest
can catch some basic discrepancies, but I think that is limited)pyarrow-stubs
Thoughts on this? Preferences? Other options you can think of?
(as someone new to this part of Python, I do appreciate feedback and the chance that I might got things quite wrong!)
Personally, I think that if we have a decent solution for auto-generation that produces "good enough" stubs, that would be my preference.
I have been looking into some of the existing (partial / abandoned) solutions, and I do think we can get something working short term. For example, I think it should be possible to get mypy's stubgen working to recognize cython's cyfunction
as a normal function with minimal patches. Otherwise a similar approach as mpi4py could also work. Or the suggestion of adding inline comments with type hints an a script to extract those.
And longer term we could look into reviving the PR in the cython repo to generate stub files.
Thank you for your recognition. I accidentally discovered that the package I created a long time ago in my free time could be helpful to you, and I’m happy to contribute code to your project.
Additionally, if pyarrow needs to maintain its own type annotations, I recommend using the wrapper pattern, as .pyi
files are limited in that they cannot include documentation.
For example, if there is a function add
in _lib.so
, we can define a function with the same name in lib.py
, add type annotations and documentation to it, and have it call _lib.add
.
.pyi
files are limited in that they cannot include documentation.
.pyi
files can include documentation but not all tools will look for documentation in .pyi
files. In particular, .pyi
files will not set __doc__
on objects. But tools like vscode and https://github.com/mkdocstrings/python will use docstrings from .pyi
files.
mypy and static type checking
As of Python3.6, it has been possible to introduce typing information in the code. This became immensely popular in a short period of time. Shortly after, the tool
mypy
arrived and this has become the industry standard for static type checking inside Python. It is able to check very quickly for invalid types which makes it possible to serve as a pre-commit. It has raised many bugs that I did not see myself and has been a very valuable tool.Now what does this mean for PyArrow?
When we run mypy on code that uses PyArrow, you will get error message as follows:
More information is available here: https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-library-stubs-or-py-typed-marker
You can solve this in three ways:
Any
, making it unable to find user errors with the PyArrow libraryCreate a
py.typed
file and use inline type hints. This is the most popular option today because it requires no extra files (except for the py.typed file), allows all the type hints to be with the code (like now in the documentation) and not only provides your users but also the developers of the library themselves with type hints (and hinting of issues inside your IDE).My personal opinion already shines through the options, it is 3 as this has shortly become the industry standard since the introduction.
What should we do?
I'd very much like to work on this, however, I don't feel like wasting time. Therefore, I am raising this ticket to see if this had been considered before or if we just didn't get to this yet.
I'd like to open the discussion here:
Should we remove the documentation annotations for the type hints given they will be inside the functions? Or should we keep it and specify it in the code? Which would make it double.
Reporter: Jorrick Sleijster
Note: This issue was originally created as ARROW-17335. Please see the migration documentation for further details.