IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
314 stars 135 forks source link

[Bug] dpk-connector doesn't crawl https://thealliance.ai/ #777

Closed sujee closed 1 week ago

sujee commented 2 weeks ago

Search before asking

Component

Other

What happened + What you expected to happen

i have given 'https://thealliance.ai/' as base url to download. The crawl doesn't download any pages.

how ever, it crawls 'https://thealliance.ai/our-work' successfully :-)

Reproduction script

https://github.com/sujee/data-prep-kit/blob/html-processing-1/examples/notebooks/html-processing/1_download_site.ipynb

Anything else

No response

OS

Ubuntu

Python

3.11.x

Are you willing to submit a PR?

sujee commented 2 weeks ago

CC : @Qiragg

hmtbr commented 2 weeks ago

@sujee Can you provide more details about what the problem is? I cannot understand what the bug is, how I can reproduce it, or what the expected behavior is from your description.

sujee commented 2 weeks ago

@hmtbr I have given you the reproduction script. You can see the code here https://github.com/sujee/data-prep-kit/blob/html-processing-1/examples/notebooks/html-processing/1_download_site.ipynb

When I tried to crawl 'https://thealliance.ai/' nothing is being downloaded.

Can you crawl this site? Can you confirm if this working for you?

hmtbr commented 2 weeks ago

@sujee I cannot reproduce the issue. I can crawl pages with the seed url https://thealliance.ai/ using the following code:

from dpk_connector import crawl, shutdown

def main():
    """
    An example of running crawler.
    """

    def on_downloaded(url: str, body: bytes, headers: dict) -> None:
        print(f"url: {url}, headers: {headers}, body: {body[:64]}")

    user_agent = "Mozilla/5.0 (X11; Linux i686; rv:125.0) Gecko/20100101 Firefox/125.0"

    crawl(
        ["https://thealliance.ai/"],
        on_downloaded,
        user_agent=user_agent,
        depth_limit=1,
        subdomain_focus=True,
    )  # blocking call

    shutdown()

if __name__ == "__main__":
    main()
Qiragg commented 2 weeks ago

Hi @sujee, it is not a bug in the connector.

There was an error early on while trying to save https://thealliance.ai/:

Visited url: https://thealliance.ai/
input/
Error in on_downloaded callback: [Errno 21] Is a directory: 'input/'

This is the home page and the function get_filename_from_url doesn't retrieve any filename from this url: https://thealliance.ai/ because there isn't any.

As a general practice, I recommend generating either a hash or using the complete url as the filename if crawling many pages. It gets around odd cases as this and also sometimes the downloaded pages may get overwritten if there are other pages with the same name that you crawled earlier (but differing in the complete url).

We are not using exception handling while trying to catch the error so the error doesn't get printed in this case.

I added exception handling for the cell where you run the crawl in your fork. Please try again using this below

from dpk_connector import crawl, shutdown
import nest_asyncio
import os
from my_utils import get_mime_type, get_filename_from_url
from dpk_connector.core.utils import validate_url

# Enable a nested event loop run for the crawler inside Jupyter Notebook
nest_asyncio.apply()

# Initialize counters
retrieved_pages = 0
saved_pages = 0

# Callback function to be executed at the retrieval of each page during a crawl
def on_downloaded(url: str, body: bytes, headers: dict) -> None:
    global retrieved_pages, saved_pages
    try:
        retrieved_pages += 1
        if saved_pages < MY_CONFIG.CRAWL_MAX_DOWNLOADS:
            print(f"Visited url: {url}")

        # Get mime_type of retrieved page
        mime_type = get_mime_type(body)

        # Save the page if it is a PDF to only download research papers
        if MY_CONFIG.CRAWL_MIME_TYPE in mime_type.lower():
            filename = get_filename_from_url(url)
            local_file_path = os.path.join(MY_CONFIG.INPUT_DIR, filename)
            print(local_file_path)

            with open(local_file_path, 'wb') as f:
                f.write(body)

            if saved_pages < MY_CONFIG.CRAWL_MAX_DOWNLOADS:
                print(f"Saved contents of url: {url}")
            saved_pages += 1
    except Exception as e:
        print(f"Error while saving downloaded content: {e}")

# Define a user agent
user_agent = "dpk-connector"

# Function to run the crawl
async def run_my_crawl():
    try:
        crawl(
            [MY_CONFIG.CRAWL_URL_BASE],
            on_downloaded,
            user_agent=user_agent,
            depth_limit=MY_CONFIG.CRAWL_MAX_DEPTH,
            path_focus=True,
            download_limit=MY_CONFIG.MAX_DOWNLOADS
        )
        return "Crawl is done"
    except Exception as e:
        return f"Error during crawl: {e}"

# Run the crawl
await run_my_crawl()
sujee commented 2 weeks ago

HI @Qiragg good catch. I will test with this. thank you! :smile:

sujee commented 2 weeks ago

another question @Qiragg

is saving the content to local file left to the call back function?

Qiragg commented 2 weeks ago

another question @Qiragg

is saving the content to local file left to the call back function?

Yes.

The core crawler only retrieves the content. Doing anything with the crawled content is the choice of the user which can be decided in the callback function. We do not provide out-of-the-box support for that in the core library.

Qiragg commented 2 weeks ago

@touma-I I would mark this issue as resolved and close it. cc: @sujee

sujee commented 2 weeks ago

@Qiragg quick question: Can we make sure any exception from call back function is logged , so we are aware of any errors. I think this should be done in the core. thoughts?

Qiragg commented 2 weeks ago

@sujee the callback function is user-defined and not part of the core? I didn't understand your requirement.