[Feature] improve parameters for crawl function for DPK-Connector

IBM / data-prep-kit

Open source project for data preparation of LLM application builders

https://ibm.github.io/data-prep-kit/

Apache License 2.0

307 stars 134 forks source link

[Feature] improve parameters for crawl function for DPK-Connector #779

Closed sujee closed 1 week ago

sujee commented 2 weeks ago

Search before asking

[X] I searched the issues and found no similar issues.

Component

Other

Feature

crawl([MY_CONFIG.CRAWL_URL_BASE], 
          on_downloaded,  
          user_agent=user_agent, 
          depth_limit = MY_CONFIG.CRAWL_MAX_DEPTH, 
          path_focus = True, 
          download_limit = MY_CONFIG.CRAWL_MAX_DOWNLOADS)

crawl function should take download directory location
and arguments like depth_limit , download_limit should be made available to callback function on_downloaded. Currently on_downloaded checks these arguments using global variables. They should be from the arguments passed in

sample code : https://github.com/sujee/data-prep-kit/blob/html-processing-1/examples/notebooks/html-processing/1_download_site.ipynb

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

sujee commented 2 weeks ago

CC : @Qiragg

touma-I commented 2 weeks ago

@Qiragg @hmtbr Let's discuss this in context of other requirements we are aware of. I am not sure we want to do what @sujee is asking for but open to suggestions how we can reconcile various requirements

hmtbr commented 2 weeks ago

@touma-I

crawl function should take download directory location

The data-prep-connector is intended to work without any persistent layer. Introducing this would be against the design.

arguments like depth_limit , download_limit should be made available to callback function on_downloaded

I'm not sure why this is required. These static values can be embedded inside the callback function in defining it. Implementing this would just introduce redundancy into our library.

touma-I commented 1 week ago

@sujee @Bytes-Explorer I will be closing this issue with no action. We need to decide at some point if we want to expose the crawl function to the notebook users. I don't think it is a good idea for now.

touma-I commented 1 week ago

the crawl function is not one we thought to expose to the notebook users. We will revisit this at some future time.