apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev/python/
Apache License 2.0
4.65k stars 319 forks source link

fix: Fix BaseDatasetClient.iter_items type hints #680

Closed Pijukatel closed 2 weeks ago

Pijukatel commented 2 weeks ago

Missing await to return AsyncIterator instead of coroutine.

janbuchar commented 2 weeks ago

Awesome! Could you add a simple test case to make sure this doesn't happen again?

B4nan commented 2 weeks ago

@Pijukatel one note from me, please don't rebase once you get a review (or don't rebase at all except for rebasing changes from master to your branch), reviews are harder if not impossible because of that, since all we see is a single commit. We squash merge PRs so they end up as a single commit in master, so doing this inside the PR is not needed and only complicates the review.

(it's more important for future work, for small PRs like this its surely not a big deal)


Also - tests, tests, tests. If you ask me, you should start with that and only then write some code to fix the problem :] They can be really small, or just an extension of some existing tests, but we need at least something.

Pijukatel commented 2 weeks ago

@Pijukatel one note from me, please don't rebase once you get a review (or don't rebase at all except for rebasing changes from master to your branch), reviews are harder if not impossible because of that, since all we see is a single commit. We squash merge PRs so they end up as a single commit in master, so doing this inside the PR is not needed and only complicates the review.

(it's more important for future work, for small PRs like this its surely not a big deal)

Also - tests, tests, tests. If you ask me, you should start with that and only then write some code to fix the problem :] They can be really small, or just an extension of some existing tests, but we need at least something.

Yes, I understand. I have to get used to this workflow. I worked a lot in gerrit previously, where amending is the way as the tool itself will show all the individual amended patchsets independently. I am adding test in other repo as here I will be changing just type hints.

janbuchar commented 2 weeks ago

It seems wrong to me to define an abstract method as synchronous and then override it with an asynchronous implementation.

The base method is annotated with an AsyncIterator return type, so it's still pretty async to me... Obligatory "mypy complains if we do it any other way" :smile:

Of course, feel free to elaborate on this @Pijukatel.

Pijukatel commented 2 weeks ago

It seems wrong to me to define an abstract method as synchronous and then override it with an asynchronous implementation.

The base method is annotated with an AsyncIterator return type, so it's still pretty async to me... Obligatory "mypy complains if we do it any other way" 😄

Of course, feel free to elaborate on this @Pijukatel.

I took the solution as one of the workarounds discussed here: https://github.com/python/mypy/issues/5070

But looking to the mypy documentation they recommend more verbose and explicit approach for abstract async iterators: https://mypy.readthedocs.io/en/stable/more_types.html#asynchronous-iterators

class LauncherAlsoCorrect(Protocol):
    async def launch(self) -> AsyncIterator[int]:
        raise NotImplementedError
        if False:
            yield 0

I have no preference here, I think both approaches are ugly :D

vdusek commented 2 weeks ago

I see, thanks... I probably prefer

    raise NotImplementedError
    if False:
        yield 0

more (with appropriate comment), as the method declarations remain consistent.