IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
307 stars 134 forks source link

[Feature] improve parameters for crawl function for DPK-Connector #779

Closed sujee closed 1 week ago

sujee commented 2 weeks ago

Search before asking

Component

Other

Feature

crawl([MY_CONFIG.CRAWL_URL_BASE], 
          on_downloaded,  
          user_agent=user_agent, 
          depth_limit = MY_CONFIG.CRAWL_MAX_DEPTH, 
          path_focus = True, 
          download_limit = MY_CONFIG.CRAWL_MAX_DOWNLOADS)

sample code : https://github.com/sujee/data-prep-kit/blob/html-processing-1/examples/notebooks/html-processing/1_download_site.ipynb

Are you willing to submit a PR?

sujee commented 2 weeks ago

CC : @Qiragg

touma-I commented 2 weeks ago

@Qiragg @hmtbr Let's discuss this in context of other requirements we are aware of. I am not sure we want to do what @sujee is asking for but open to suggestions how we can reconcile various requirements

hmtbr commented 2 weeks ago

@touma-I

crawl function should take download directory location

The data-prep-connector is intended to work without any persistent layer. Introducing this would be against the design.

arguments like depth_limit , download_limit should be made available to callback function on_downloaded

I'm not sure why this is required. These static values can be embedded inside the callback function in defining it. Implementing this would just introduce redundancy into our library.

touma-I commented 1 week ago

@sujee @Bytes-Explorer I will be closing this issue with no action. We need to decide at some point if we want to expose the crawl function to the notebook users. I don't think it is a good idea for now.

touma-I commented 1 week ago

the crawl function is not one we thought to expose to the notebook users. We will revisit this at some future time.