approach0 / search-engine

A math-aware search engine.
http://approach0.xyz
MIT License
346 stars 50 forks source link

AoPS (Art of Problem Solving) forum crawler #23

Closed TheSil closed 6 years ago

TheSil commented 6 years ago

Made a simple AoS crawler, it works by specifying either single post, or time range (currently relative dates, not absolutes). Also the AoS forums do not work on same principle as the MathSE, thats why I use time (number of days) rather than number of pages. Also I had to use ajax php requests since that is the way the data are fetched.

Anyway few examples to try: -c 6 -p 1639347 This fetches https://artofproblemsolving.com/community/c6h1639347_sum_of_squares .

-c 7 -n 0 -o 7 This fetches posts from category 7 (https://artofproblemsolving.com/community/c7), posts that are 7 days old

-c 7 -n 7 -o 14 This fetches posts from category 7 (https://artofproblemsolving.com/community/c7), posts that are 7-14 days old

Few considerations: The relative time API might not be best, could be perhaps improved by using absolute times, could improve this in the future. I see biggest challenge with organizing these different forums categories, but it seems you can just fetch following threads to get vas majority of posts on AoS: -c 3 (Middle School Math, > 33k posts) -c 4 (High School Math > 71k posts) -c 5 (Contests & Programs > 17k posts) -c 6 (High School Olympiads > 214k posts) -c 7 (College Math > 78k posts)

Also the folders and files layout, I have added prefix aos to files in order todistinguish between SE and Aos posts, you might want to adjust that.

I am using few more libs though, slimit for javascript parsing and certifi since on my machine the SSL links did not work (maybe Windows only issue, I dont know). Also along the way I encountered few issues witn encoding and parsing even the Math.SE posts, so I fixed those, but feel free to ommit/remove it, but I believe it improved the parser.

Although it works fine, I imagine there might be some additional updates, such as putting common crawlers stuff into common file, but I mainly botherred with functionality at this point, to get this going.

TheSil commented 6 years ago

I wonder why this Travis thing failed, but It seems like it is unrelated to the requesting python changes (C compilation issues).

w32zhong commented 6 years ago

Thank you!

The Travis failing is not your issue, it is fixed in my research branch (24f9afa9b8e09f6b55b582ab261f27a95fc6d3a4).

TheSil commented 6 years ago

Good to hear, hope it works for you! How are the crawlers scheduled by the way? I mean how often do you fetch the data?

Also a side note, the bbcode stuff etc It was mainly because the html previews looked shitty when I opened them in chrome (it marked every instance of it red, which is quite a lot on some pages...). Also I believe without the the certifi it caused issues with math SE.

w32zhong commented 6 years ago

Yes, it seems working perfectly. Again, I am very thankful you send me the request. I was just trying to bring the patches from my research branch to master (there are some important updates for crawler scripts, e.g., Math Stack Exchange recently becomes very sensitive to crawling and very often prevent our scripts from crawling if we keep a constant frequency, see: 6b9cf283b44b8f8cd090e611156a0716f2d76bbe), sorry for the pause to get back.

For your question, It is a little embarrassing that I do not have a script for updating the index regularly, and the online version of Approach0 is not updated for months! (Although at my lab I am running a crawler and I have basically an updated corpus of Math Stack Exchange, but did not get a chance to update the indices of online demo) In fact, I recently focus on improving the model and have achieved a significant accuracy (state-of-the-art I assume.... I am hoping I can publish it in a paper soon). I feel I do not have enough time to write an "index update scheduler" because that is not my current focus, A0 is more like a demo instead of a serious search engine now.

But since you make it possible to have a new site indexed now, I will take some time to expand the index of approach0.xyz, to let AoPS in, and see how it looks like.

w32zhong commented 6 years ago

I also have a few questions I would like to ask you. @TheSil

  1. Is there any example post you record that shows this function is helpful? If they do appear frequently in AoPS, I will consider add a replace function back. Also, you said without bbcode your Chrome will show a lot of red instances, so I am curious how to reproduce or why bbcode can help in this case.

  2. How do I specify the argument so that the crawler will fetch all the posts (before current time) from given category?

  3. In updated MSE crawler, it uses a simple increase-delay-and-try-again scheme to avoid using hard-coded number of sleep time (https://github.com/approach0/search-engine/commit/8a7a2bc043ff8c9c6510a576eab741d59dc82b03#diff-7ae68dbd7b22a3d8e6ad1f75eb526ed2R171), where or how should I insert code to add similar scheme to your AoPS crawler?

I guess I have to sleep now, I may not reply from here very promptly.

BTW. Later when I get the chance to re-index approach0 to include AoPS, will let you know (ping you here in this thread) and I think it would be a good way to say thank you.

TheSil commented 6 years ago

Yes there is never enough time :) Even though this is just a demo, it definitely helps a lot, so keep a good job with the project!

As to your questions:

Ad 1. Sure let me find few examples. The bbcode was mostly because of readability, the real issue (at least from Chrome parsing point of view) were the \bold, \equal, \plus etc tags. I have found couple extreme examples.

Also quite often there is "\/" which should really be "/", for example: