bencabrera / grawitas

Grawitas is a lightweight, fast parser for Wikipedia talk pages that takes the raw Wikipedia-syntax and outputs the structured content in various formats.
MIT License
7 stars 5 forks source link

Underscore ('_') in page titles for cli_crawler #18

Open lewishamulton opened 2 years ago

lewishamulton commented 2 years ago

I've noticed a small issue in the Grawitas cli_crawler when if you include a .txt file of page titles where the words of those titles are seperated by an underscore, the crawler will go into an infinite loop.

The issue is crawling.cpp will simply continue to download non-existent archives of pages. It seems if the page does exist, for example the page Vince Staples, then Grawitas will correctly parse the page that has been included in the .txt file as Vince_Staples. However, it will then go into an infinite loop as shown in the screenshot (for reference Vince Staples has no archives)

For a page that does not exist at all it will simply go into an infinite loop without parsing anything.

Screenshot 2022-01-07 at 15 57 51

I believe (although I'm not certain) this is because line 159 on crawling.cpp uses a remove_if: page_progress.erase(std::remove_if(page_progress.begin(), page_progress.end(), [&result](const std::pair<std::string,int>& page)

Since remove_if has the range [first_element, last element) if the last result to be processed from page_progress is missing due to this range it will never be removed. Hence the condition for the while loop on line 100 of crawling.cpp will never be satisified and the crawler will never stop downloading non-existent.

I've found a fix for it although I'm not sure why specifcally if an underscore is included in the title it goes into the infinite loop? When testing with a file of incorrect page titles without underscores it will not go into an infinite loop. This leads me to believe my theory with the remove_if(..) might not be entirely correct as I would expect such an input file to go into an infinite loop as well with the last missing page result from page_progress never being removed.

Hope this makes sense.