Article being double posted

fterh / rsg-retrivr

This Reddit bot is all about the "too lazy; didn't click" life

7 stars 2 forks source link

One method is to replace the soup.select("p"), at line 43 of mercury.py, with soup.find_all(lambda tag: tag.name == "p" and not tag.attrs) with the assumption that all legitimate paragraphs are encapsulated with <p> without any classes, id, etc attributes. This means that the mistaken <p class="row"> will then be then omitted by the bot because of the class attribute.

I have tested out this solution with multiple past articles posted by the bot and found it to be generally reliable except for this yahoo article and scmp.com article where certain paragraphs are omitted by the bot because of their <p class="xxxx"> tag.

fterh / rsg-retrivr

Article being double posted #13