fterh / rsg-retrivr

This Reddit bot is all about the "too lazy; didn't click" life
https://reddit.com/u/rsg-retrivr
7 stars 2 forks source link

Article being double posted #13

Open changhuapeng opened 6 years ago

changhuapeng commented 6 years ago

As per title, the bot is double posting certain articles from straitstimes.com

The Mercury web parser seems to be confused with <div class="row">, an outer div element that encapsulates the whole article to be <p class="row">. The bot then mistakenly post the content in this <p class="row"> assuming it to be the first paragraph before moving on to post the actual scraped paragraphs.

changhuapeng commented 6 years ago

One method is to replace the soup.select("p"), at line 43 of mercury.py, with soup.find_all(lambda tag: tag.name == "p" and not tag.attrs) with the assumption that all legitimate paragraphs are encapsulated with <p> without any classes, id, etc attributes. This means that the mistaken <p class="row"> will then be then omitted by the bot because of the class attribute.

I have tested out this solution with multiple past articles posted by the bot and found it to be generally reliable except for this yahoo article and scmp.com article where certain paragraphs are omitted by the bot because of their <p class="xxxx"> tag.