Scarvy / readwise-to-apple-notes

Export Readwise highlights to Apple Notes.
Apache License 2.0
14 stars 0 forks source link

`export` not exporting all highlights #4

Open Scarvy opened 1 week ago

Scarvy commented 1 week ago

I'm suspicious that the export_highlight function is not giving me all my highlights.

I ran it recently and only received 82 total notes in Apple Notes. I have way more than that...

I think it has something to do with the generator or the API wrapper I'm using.

def export_highlights(
    updated_after: str = None, book_ids: str = None, token: str = None
) -> Generator[dict, None, None]:
    """Exports the highlights of books based on modification date and/or specific book IDs.

    This function iterates over pages of highlights fetched from the client service,
    filtering by update time and book IDs if provided, and yields each highlight.

    Parameters:
        updated_after (str, optional): The ISO 8601 date string to filter highlights
            that were updated after a certain date. Defaults to None.
        book_ids (str, optional): A comma-separated string of book IDs to filter
            highlights by specific books. Defaults to None.
        token (str): A Readwise API token. Default to None.

    Yields:
        dict: A dictionary representing a single book's highlight.
    """
    client = get_client(token)

    params = {}

    if updated_after:
        params["updatedAfter"] = updated_after
    if book_ids:
        params["ids"] = book_ids

    for data in client.get_pagination_limit_20("/export/", params=params):
        for book in data["results"]:
            yield book
Scarvy commented 1 week ago

The issue was in the API wrapper (pyreadwise) not requesting the next page in the pagination. Based on the API documentation, the /export/ endpoint uses the parameter pageCursor while the other endpoints like /highlights/ use page.

pageCursor – (Optional) A string returned by a previous request to this endpoint. Use it to get the next page of books/highlights if there are too many for one request. page – specify the pagination counter.

Scarvy commented 1 week ago

I made a quick fix like this that seems to work. I need to ensure it does not break the other pagination endpoint requests.

def _get_pagination(
        self,
        get_method: Literal['get', 'get_with_limit_20'],
        endpoint: str,
        params: dict = {},
        page_size: int = 1000,
    ) -> Generator[dict, None, None]:
        '''
        Get a response from the Readwise API with pagination.

        Args:
            get_method: Method to use for making requests
            endpoint: API endpoint
            params: Query parameters
            page_size: Number of items per page
        Yields:
            dict: Response data
        '''
        if endpoint == "/export/":
            pageCursor = None
            while True:
                if pageCursor:
                    params.update({"pageCursor": pageCursor})
                logging.debug(f'Getting page with cursor "{pageCursor}"')
                try:
                    response = getattr(self, get_method)(endpoint, params=params)
                except ChunkedEncodingError:
                    logging.error(f'Error getting page with cursor "{pageCursor}"')
                    sleep(5)
                    continue
                data = response.json()
                yield data
                if (
                    isinstance(data, list)
                    or not data.get("nextPageCursor")
                    or data.get("nextPageCursor") == pageCursor
                ):
                    break
                pageCursor = data.get("nextPageCursor")
        else:
            page = 1
            while True:
                response = getattr(self, get_method)(
                    endpoint, params={"page": page, "page_size": page_size, **params}
                )
                data = response.json()
                yield data
                if isinstance(data, list) or not data.get("next"):
                    break
                page += 1
Scarvy commented 1 week ago

I am deciding whether to create a pull request in the original API wrapper repo or write my own. I'm leaning toward making a pull request.