Closed sujee closed 1 week ago
CC : @Qiragg
@sujee Can you provide more details about what the problem is? I cannot understand what the bug is, how I can reproduce it, or what the expected behavior is from your description.
@hmtbr I have given you the reproduction script. You can see the code here https://github.com/sujee/data-prep-kit/blob/html-processing-1/examples/notebooks/html-processing/1_download_site.ipynb
When I tried to crawl 'https://thealliance.ai/' nothing is being downloaded.
Can you crawl this site? Can you confirm if this working for you?
@sujee I cannot reproduce the issue. I can crawl pages with the seed url https://thealliance.ai/
using the following code:
from dpk_connector import crawl, shutdown
def main():
"""
An example of running crawler.
"""
def on_downloaded(url: str, body: bytes, headers: dict) -> None:
print(f"url: {url}, headers: {headers}, body: {body[:64]}")
user_agent = "Mozilla/5.0 (X11; Linux i686; rv:125.0) Gecko/20100101 Firefox/125.0"
crawl(
["https://thealliance.ai/"],
on_downloaded,
user_agent=user_agent,
depth_limit=1,
subdomain_focus=True,
) # blocking call
shutdown()
if __name__ == "__main__":
main()
Hi @sujee, it is not a bug in the connector.
There was an error early on while trying to save https://thealliance.ai/
:
Visited url: https://thealliance.ai/
input/
Error in on_downloaded callback: [Errno 21] Is a directory: 'input/'
This is the home page and the function get_filename_from_url
doesn't retrieve any filename from this url: https://thealliance.ai/
because there isn't any.
As a general practice, I recommend generating either a hash or using the complete url as the filename if crawling many pages. It gets around odd cases as this and also sometimes the downloaded pages may get overwritten if there are other pages with the same name that you crawled earlier (but differing in the complete url).
We are not using exception handling while trying to catch the error so the error doesn't get printed in this case.
I added exception handling for the cell where you run the crawl in your fork. Please try again using this below
from dpk_connector import crawl, shutdown
import nest_asyncio
import os
from my_utils import get_mime_type, get_filename_from_url
from dpk_connector.core.utils import validate_url
# Enable a nested event loop run for the crawler inside Jupyter Notebook
nest_asyncio.apply()
# Initialize counters
retrieved_pages = 0
saved_pages = 0
# Callback function to be executed at the retrieval of each page during a crawl
def on_downloaded(url: str, body: bytes, headers: dict) -> None:
global retrieved_pages, saved_pages
try:
retrieved_pages += 1
if saved_pages < MY_CONFIG.CRAWL_MAX_DOWNLOADS:
print(f"Visited url: {url}")
# Get mime_type of retrieved page
mime_type = get_mime_type(body)
# Save the page if it is a PDF to only download research papers
if MY_CONFIG.CRAWL_MIME_TYPE in mime_type.lower():
filename = get_filename_from_url(url)
local_file_path = os.path.join(MY_CONFIG.INPUT_DIR, filename)
print(local_file_path)
with open(local_file_path, 'wb') as f:
f.write(body)
if saved_pages < MY_CONFIG.CRAWL_MAX_DOWNLOADS:
print(f"Saved contents of url: {url}")
saved_pages += 1
except Exception as e:
print(f"Error while saving downloaded content: {e}")
# Define a user agent
user_agent = "dpk-connector"
# Function to run the crawl
async def run_my_crawl():
try:
crawl(
[MY_CONFIG.CRAWL_URL_BASE],
on_downloaded,
user_agent=user_agent,
depth_limit=MY_CONFIG.CRAWL_MAX_DEPTH,
path_focus=True,
download_limit=MY_CONFIG.MAX_DOWNLOADS
)
return "Crawl is done"
except Exception as e:
return f"Error during crawl: {e}"
# Run the crawl
await run_my_crawl()
HI @Qiragg good catch. I will test with this. thank you! :smile:
another question @Qiragg
is saving the content to local file left to the call back function?
another question @Qiragg
is saving the content to local file left to the call back function?
Yes.
The core crawler only retrieves the content. Doing anything with the crawled content is the choice of the user which can be decided in the callback function. We do not provide out-of-the-box support for that in the core library.
@touma-I I would mark this issue as resolved and close it. cc: @sujee
@Qiragg quick question: Can we make sure any exception from call back function is logged , so we are aware of any errors. I think this should be done in the core. thoughts?
@sujee the callback function is user-defined and not part of the core? I didn't understand your requirement.
Search before asking
Component
Other
What happened + What you expected to happen
i have given 'https://thealliance.ai/' as base url to download. The crawl doesn't download any pages.
how ever, it crawls 'https://thealliance.ai/our-work' successfully :-)
Reproduction script
https://github.com/sujee/data-prep-kit/blob/html-processing-1/examples/notebooks/html-processing/1_download_site.ipynb
Anything else
No response
OS
Ubuntu
Python
3.11.x
Are you willing to submit a PR?