IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
307 stars 134 forks source link

[Feature] [Connector] Apply subdomain focus based on the seed url #724

Closed hmtbr closed 1 month ago

hmtbr commented 1 month ago

Search before asking

Component

Other

Feature

If the user provides https://research.example.com as a seed url for the data-prep-connector, there is a requirement that the user wants to automatically apply subdomain focus so we do not crawl other subdomains than research for the domain example.com.

The user may also provide multiple seed urls like corporate.example.com research.example.com

in which case we would like to crawl only two subdomains - corporate and research

Are you willing to submit a PR?

jo-in commented 1 month ago

@hmtbr Can you also confirm the behavior when you specify a seed URL like https://arxiv.org/year/astro-ph/2024 ? Will it be able to download papers from https://arxiv.org/list/astro-ph/2024-01 (note that the path starts with /path and not /year) or only those URLs starting with the exact seed URL ?

hmtbr commented 1 month ago

@jo-in Yes, it can download the urls with different paths as such unless you specify path_focus=True (defaults to False).