fake-news-detector / api

API for saving news flagging by the users
https://fake-news-detector-api.herokuapp.com/
8 stars 1 forks source link

Add page scrapping #14

Closed rogeriochaves closed 6 years ago

rogeriochaves commented 6 years ago

This partially solves #2, partially because it only saves data for new links, it does not update the old ones (we might need a job for that), neither use the text for predictions yet

So, I wanted to get the article text out of a page, just the article text, without the html and unimportant texts such as menus and ads, and this is not easy at all, so I've searched for existing libraries.

First, I searched for something in rust, but couldn't find anything. Then I started searched in general and a lot of links pointed to python.

I found python-goose, but it seems outdated. Then I found newspaper, which seems awesome! Very popular and very updated, but... it didn't work very well on the links I've tested, and I tried quite a while to make it work well, but it does not have very much options.

Then I went back to the general search again, found some material for nodejs scrapping and ended up bumping into unfluff, which worked very well!

Some links I've tested with:

And unfluffy could extract the texts from them pretty well! Except in the last 2, where it cut into a smaller part of the text.

For facebook links it didn't work, and it's important it does since we can flag facebook page posts. So for facebook I wrote manually a very simple scrapper in rust, to use with links like this: https://www.facebook.com/VerdadeSemManipulacao/videos/479313152193503/