apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev/python/
Apache License 2.0
3.71k stars 251 forks source link

pylance reportPrivateImportUsage #283

Closed zigai closed 1 month ago

zigai commented 1 month ago

When using VS Code, pylance (v2024.7.1, latest) frequently reports reportPrivateImportUsage errors when importing classes. image

These errors can be fixed by defining a __all__ list in the module's __init__.py file, which explicitly declares the public names for the module.

Here's an example for the http_crawler module:

from .http_crawler import HttpCrawler
from .types import HttpCrawlingContext, HttpCrawlingResult

__all__ = ["HttpCrawler", "HttpCrawlingContext", "HttpCrawlingResult"]
vdusek commented 1 month ago

@zigai Thanks for opening the issue. It's interesting because I also use Pylance v2024.7.1 and haven't encountered this problem.

Additionally, I believe the __all__ field is intended just for "star imports" statements, such as:

from crawlee.http_crawler import *

am I correct?

Because the statement Pylance tells you isn't true. HttpCrawler is exported, and can be imported directly:

from crawlee.http_crawler import HttpCrawler
zigai commented 1 month ago

Pylance type checking mode has to be turned on to see these warnings. It's set to "off" by default, which may be why you don't see them.

The warnings pylance shows aren't true, all imports work correctly on the Python level, but beacuse pylance has problems seeing the actual structure of the module, the user experience for using crawlee inside VS Code is not as good as it could be:

  1. Most of the imports from crawlee get tagged with false reportPrivateImportUsage warnings.
  2. Users don't get code completion when writing import statements. Suggestions in the image below should be HttpCrawler, HttpCrawlingContext and HttpCrawlingResult.

image

It's true that the __all__ field is meant for star imports in Python, but some other tools use it for different purposes like determining the public API of a package, which exaplainds why adding it fixes both the warnings and code completion.

If you are able to replicate these problems keep in mind that this solution is more of a workaround, pylance should work without __all__ being defined. This might actually be a problem with how the package is being built.

visrut-at-incubyte commented 1 month ago

Yeah tried __all__ seems working.