aitjcize / cppman

C++ 98/11/14 manual pages for Linux/MacOS
GNU General Public License v3.0
1.27k stars 79 forks source link

A more rubust crawler #147

Closed glenvt18 closed 1 year ago

glenvt18 commented 1 year ago

This PR fixes and enhances the crawler which currently looses quite a lot URLs due to network errors. For example, the number of pages in the cppreference.com table increases from 4286 to 5341. Now all failed downloads are collected and retried after a round of crawling. The process continues until either 1) there are no failed downloads or 2) no download progress is seen for a number of rounds (self.max_failed_retries).

The PR also fixes the issue with multiple pages having the same title such as std::move and std::move. In such cases a part of the path is added to the title resulting in std::move (algorithm) and std::move (utility). There are 126 such pages (both cplusplus.com and cppreference.com).