Coffeeboys / VanCoffee

An exploration of roasts and roasteries in Metro Vancouver
0 stars 0 forks source link

Web scraping roastery websites #1

Open emreerhan opened 6 years ago

emreerhan commented 6 years ago

Building the data will be the first step, and maybe the most difficult step.

To-do:

landalex commented 6 years ago

Scrapy has a nicer website so it's obviously superior.

emreerhan commented 6 years ago

Scrapy's documentation looks nicer too.

emreerhan commented 6 years ago

https://hexfox.com/p/scrapy-vs-beautifulsoup/

So the difference between the two is actually quite large: Scrapy is a tool specifically created for downloading, cleaning and saving data from the web and will help you end-to-end; whereas BeautifulSoup is a smaller package which will only help you get information out of webpages.

Afterwards the entire article is about why you should use Scrapy lol

landalex commented 6 years ago

Looking at Scrapy, the syntax is kind of cumbersome. You need to select the elements of the page you want to extract data from, but the problem is that we can't feasibly make a different scraper for each site, so either we use some kind of basic approach to finding/selecting elements or we just grab all the text and use a more language-based approach to find coffee descriptions/information in the text.

The language-based approach seems both easier and more reliable since web pages vary so wildly. In that case, BeautifulSoup is nice because it has a method to just grab all the text from a page. Maybe using both, Scrapy for the scraper to traverse sites and BeautifulSoup to grab the text?

landalex commented 6 years ago

Now I'm in an NLP rabbit whole: spaCy and Prodigy look interesting, specifically Prodigy for allowing faster annotation of data.

emreerhan commented 6 years ago

BeautifulSoup is nice because it has a method to just grab all the text from a page

That's not a bad idea. Right now my idea is to search for "hits" to a whitelist of roasteries from cafe websites. I don't know if we need any NLP for this. It might be as simple as a regex. Although I'm definitely not opposed to it if you can think of something clever.

EChisholm commented 6 years ago

I decided to take a crack at writing a general scraper with the Requests library to fetch the HTML and Beautiful Soup to crawl the site for the rest of the pages. It's still very much a work in progress, so I won't be pushing my work so far for a bit. My baseline has been working with Matchstick's site and Agro's site, but Agro's site is unfortunately not very deep.... they don't even list details of their coffees.... I haven't looked deeply into the other roasters sites, but it'll be pretty dissapointing to find out that most of them don't even have relevant roast details available...

emreerhan commented 6 years ago

@EChisholm I think a good first start is just getting information of which independent cafes host which local roasts. I agree, let's avoid tasting notes and other roast details for now.