This PR fixes and enhances the crawler which currently looses quite a lot URLs due to network errors. For example, the number of pages in the cppreference.com table increases from 4286 to 5341. Now all failed downloads are collected and retried after a round of crawling. The process continues until either 1) there are no failed downloads or 2) no download progress is seen for a number of rounds (self.max_failed_retries).
The PR also fixes the issue with multiple pages having the same title such as std::move and std::move. In such cases a part of the path is added to the title resulting in std::move (algorithm) and std::move (utility). There are 126 such pages (both cplusplus.com and cppreference.com).
This PR fixes and enhances the crawler which currently looses quite a lot URLs due to network errors. For example, the number of pages in the cppreference.com table increases from 4286 to 5341. Now all failed downloads are collected and retried after a round of crawling. The process continues until either 1) there are no failed downloads or 2) no download progress is seen for a number of rounds (
self.max_failed_retries
).The PR also fixes the issue with multiple pages having the same title such as std::move and std::move. In such cases a part of the path is added to the title resulting in
std::move (algorithm)
andstd::move (utility)
. There are 126 such pages (both cplusplus.com and cppreference.com).