droyed / stackoverflow_tag_cloud

Generates tag clouds for Stack Overflow user profiles based on their Q&A activity. The intended application is to present a pictorial description of that activity.
MIT License
27 stars 10 forks source link

Rewrite most of the scraping logic for the revamped profile design #5

Open adeak opened 2 years ago

adeak commented 2 years ago

The profile pages on SO/SE have been completely rewritten (see announcement from December 7, 2021), which means much of this library has to be rewritten.

Since the profile pages are an opaque mess of nested divs now (starting to look a lot like twitter HTML), the easiest approach I could find was to find divs with titles like this:

<div class="p12 bb bc-black-075" title="0 non-wiki questions (0 score). 70 non-wiki answers (898 score).">

One tag on the tag page gets one of these divs, and this already gives us the tag score. Inside there's a tag with the tag's name for text. I didn't want to rely on those random-looking strings in the class attribute.

I've also changed a handful of things (some of them stylistic):

No doubt the company will add arbitrary small changes in a few weeks just to break scrapers like this. Until then this should work (even if slow due to the throttling/pushbacks).

rayryeng commented 3 months ago

As of this date, I had to downgrade pillow to pillow==9.5.0. I additionally had to fix numpy so that it works with MacOS M1/M2 chips: numpy==1.24.4 and I had to additionally install lxml: lxml==5.2.2. Please consider making a change to your PR and modify the requirements.txt file accordingly.