Closed hmtbr closed 1 month ago
@hmtbr Can you also confirm the behavior when you specify a seed URL like https://arxiv.org/year/astro-ph/2024 ? Will it be able to download papers from https://arxiv.org/list/astro-ph/2024-01 (note that the path starts with /path
and not /year
) or only those URLs starting with the exact seed URL ?
@jo-in Yes, it can download the urls with different paths as such unless you specify path_focus=True
(defaults to False
).
Search before asking
Component
Other
Feature
If the user provides https://research.example.com as a seed url for the data-prep-connector, there is a requirement that the user wants to automatically apply subdomain focus so we do not crawl other subdomains than research for the domain example.com.
The user may also provide multiple seed urls like corporate.example.com research.example.com
in which case we would like to crawl only two subdomains - corporate and research
Are you willing to submit a PR?