apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev/python/
Apache License 2.0
4.22k stars 295 forks source link

Explore what doc tooling we use in SDK and how it deals with dataclasses docstrings #72

Closed vdusek closed 1 month ago

vdusek commented 7 months ago

Let's consider the following example:

@dataclass
class MemorySnapshot:
    """A snapshot of memory usage.

    Args:
        total_bytes: Total memory available in the system.
        current_bytes: Memory usage of the current Python process and its children.
        max_memory_bytes: The maximum memory that can be used by `AutoscaledPool`.
        max_used_memory_ratio: The maximum acceptable ratio of `current_bytes` to `max_memory_bytes`.
        created_at: The time at which the measurement was taken.
    """

    total_bytes: int
    current_bytes: int
    max_memory_bytes: int
    max_used_memory_ratio: float
    created_at: datetime = field(default_factory=lambda: datetime.now(tz=timezone.utc))

    @property
    def is_overloaded(self) -> bool:
        """Returns whether the memory is considered as overloaded."""
        return (self.current_bytes / self.max_memory_bytes) > self.max_used_memory_ratio

Is doc tooling (maybe the one we use in SDK) able to handle it properly?

Based on the discussion in here https://github.com/apify/crawlee-py/pull/20#discussion_r1521198126.

janbuchar commented 7 months ago

Paging @barjin - could you provide some details about how we generate docs for the Python SDK?

From https://stackoverflow.com/questions/51125415/how-do-i-document-a-constructor-for-a-class-using-python-dataclasses it seems that sphinx can indeed handle Args: in the docblock in a reasonable fashion.

barjin commented 7 months ago

Following the in-office discussion, I'm sharing this here, so we can refer to it:

The current API reference for Python projects is an amalgamation of pydoc-markdown, existing tools we have for JS projects and one very ugly python-syntax-tree to javascript-syntax-tree conversion script (plus a pinch of bash scripts).

It's by no means a good solution - having something new, clean and cool in this project would be very nice (as we could then port it from here to all the other repos. Sorry to let you all down :(

vdusek commented 1 month ago

Closing this one, it has already been explored and discussed with @barjin and @janbuchar, and we have https://github.com/apify/crawlee-python/issues/324.