ScaleUnlimited / flink-crawler

Continuous scalable web crawler built on top of Flink and crawler-commons
Apache License 2.0
51 stars 18 forks source link

Use one FetchFunction and one ParseFunction for all types of URLs #34

Open kkrugler opened 7 years ago

kkrugler commented 7 years ago

Currently we have different fetchers and (effectively) different parsers for robots.txt, sitemap, and regular URLs. This isn't very clean, and duplicates code. So an alternative approach is to have a single FetchFunction and a single ParseFunction that knows how to handle the different types of URLs.

kkrugler commented 6 years ago

We could also use this to improve handling of shortened URLs. Basically flag the URL as a shortened URL, and in the FetchFunction we use a different fetcher (no redirects) to resolve. This would solve the issue of us currently (re)fetching the same shortened URL multiple times. So basically we treat it as a special case of redirection, where we're anticipating the redirection and optimizing for it.