Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
176 stars 67 forks source link

List of URLs for Redirects #954

Closed salinausd305 closed 3 days ago

salinausd305 commented 2 months ago

I'm trying to use the web crawler to get a list of URLs for our websites. We are moving to a new platform and I'm hoping to get a list of URLs for our redirects.

I have the web crawler running, I used the Config Starter page and tested, but I'm not sure how to get the data to a CSV file.

I looked at the CSVFileCommitter documentation, but I'm still not sure how to make it work.

The only data I want is a single column list of URLs in a CSV file that I can review.

Is there a way to set it up in the config file?

ohtwadi commented 2 months ago

Try using URLStatusCrawlerEventListener instead.

Since you only want the URLs and do not care about site content, pair this with a IDocumentFilter. Simplest might be to use SegmentCountURLFilter with count is set to 0 or 1 and onMatch set to exclude.

(not tested)

stale[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.