DAGWorks-Inc / hamilton

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
https://hamilton.dagworks.io/en/latest/
BSD 3-Clause Clear License
1.75k stars 112 forks source link

` Parallelizable` type checking error with `pyright` / VS Code #1114

Closed cswartzvi closed 3 weeks ago

cswartzvi commented 3 weeks ago

First off, I wanted to say "thanks", this project is a breath of fresh air and exactly what I was looking for. Please keep up the great work!

I was working along with Dynamic DAGs/Parallel Execution when I encountered a type checking error using Parallelizable as prescribed in the docs.

Current behavior

Annotating a generator function with Parallelizable (see example below) results in the following pyright type error in either VS Code (using the basic type checking mode) and/or the command line:

error: Return type of generator function must be compatible with "Generator[str, Any, Any]"
    "Generator[str, Unknown, Unknown]" is not assignable to "Parallelizable[str]" (reportReturnType)

Full disclosure: I did not encounter any errors with mypy, even in strict mode.

Screenshots

image

Note: The text of the error message is slightly different then the command line.

Steps to replicate behavior

  1. Run pyright (either via VS Code or on the command line) on the following code:
    
    from hamilton.htypes import Parallelizable, Collect

def url() -> Parallelizable[str]: for url in ["web1", "web2", "web3"]: yield url

def url_loaded(url: str) -> str: return url

def counts(url_loaded: str) -> int: return len(url_loaded.split(" "))

def total_words(counts: Collect[int]) -> int: return sum(counts)


## Library & System Information
Python versions: 3.10, 3.11, 3.12
Hamilton version: 1.75
OS: Windows 11

# Expected behavior
I expected the  `Parallelizable` type to be compatible with the generator function, but given that it is a nominal subtype of generator this is not possible.

# Additional context
After looking around the code to see how `Parallelizable` is consumed (namely in conjunction with `typing.get_origin`), wouldn't it be possible to switch to structural subtyping using `Protocol`? Something like the following?

```python
U = TypeVar("U", covariant=True)

class Parallelizable(Iterable[U], Protocol[U]): ...

def is_parallelizable_type(type_: Type) -> bool:
    return _get_origin(type_) == Parallelizable

I added these changes to a fork and ran the executor tests successfully. Note that I was having issues with other tests before I made the change - maybe because I am on Windows - so I cannot comment on the rest of the test suite. If you think this is a worthwhile addition to Hamilton I would be honored to submit a PR - let me know. Thanks!

skrawcz commented 3 weeks ago

Thanks @cswartzvi for the find! Would you mind throwing up a PR? That way we could get a clean sense of how the tests run to determine if your suggestion would require further work. Otherwise I'll defer to @elijahbenizzy here to comment on your proposed code.