lumodon commented 8 years ago

Description

"Web scraping is a technique in data extraction where you pull information from websites." *1 Create a web scraper which gathers information from the web. The tutorial listed below will take you through, step by step, in setting up your own crawler. You should modify this tutorial to fit some subject of interest to you.

Get with your team and brainstorm what types of information or data would be a good fit for scraping? Note that information which is presented such as images in a list of pokemon is best suited, and not, for example, stocks from the stock market, or other types of data which is fetched from a singular database (In which case, you would be better off hoping the host has an API such as Google Books API)

Context

There are many common / practical uses to this, and this is a technique employed by many companies these days. What types of data are a good fit for scraping? Why some types and not others? This project also gives you a great practical example of the limits that efficient code can bring, since you will be pushing the limit of a process when your crawler is running.

Specifications

Using https://scotch.io/tutorials/scraping-the-web-with-node-js as a resource, optionally with the same libraries, preferably explore alternative libraries that will still allow you to accomplish the same end result.

[ ] Complete all the steps in the tutorial.
[ ] Spec two.
[ ] Spec three.
Required
[ ] The artifact produced is properly licensed, preferably with the MIT license.
Quality Rubric
Quality rubric one: point value
Quality rubric two: point value
Quality rubric three: point value

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

*1: Kukic, Ado. "Scraping the Web With Node.js." Scotch. The Scotch Family, 13 Mar. 2014. Web. https://scotch.io/tutorials/scraping-the-web-with-node-js. Accessed 26 Oct. 2016.

TODO

Will list review and edit ideas to perform on this project over the weekend here:

[ ] Talk about distributed scraping, legal gray areas, server overloading, and potential for being (mistakenly) identified as a DDoS when doing innocous scraping, and using sleep functions to minimize 'damage' (better word for damage?)

deonna commented 8 years ago

I like the gist of this goal, and think it would be incredibly fun to work on. It'd also be a great (and perhaps frustrating) exercise in learning how to debug your code line-by-line.

Here are some questions that came up for me:

What are good candidates for "scrapeable" websites?

I'm probably not going to bother writing a scraper for a website without a large amount of useful data that is consistently formatted. Can you suggest some candidates for sites/web apps that don't have an API at the ready, but still have lots of data?

What do I do with all this data now that I've written a scraper? I'd love to see a stretch goal around this, in case folks end up find the scraping part at the lower end of their ZPD:

Am I going to be scraping to seed a DB to do some data analysis?
Am I serving it up in a nicer front end? (e.g., lots of people hate site X's interface, but go to it because it has a lot of useful information: can I scrape it and use my own design skills to display the data with a lovely UI and pleasant user experience?)
Am I really into music, but annoyed by checking a bunch of different websites for venues with their own event calendars, and would prefer to run a cron job to scrape each of them for the latest event information and put it all in one place?

I think tweaking your spec to answer some of these questions would make it a incredibly fun and valuable project to work on.

I'm kind of inspired to write an API-related goal after reading yours...looking forward to seeing folks work on it. :D

deonna commented 8 years ago

@lumodon - I found an egghead.io resource for web scraping with X-ray that looks pretty good (and it's relatively short). Problem is, most videos require a pro account -- I can log into mine if you or anyone else wants to watch them. Lemme know!

alfonsotech commented 7 years ago

I'm totally interested in working on this goal, and I would propose scraping the Stanford Encyclopedia of Philosophy (lots of data (entries) formatted in the same way, no public API). I built this as a way to get at the entries because I did't know how to build a scraper, but scrapping would be ideal for this project.

alfonsotech commented 7 years ago

@qweenwasabi and @alfonsotech were doing this goal using the linked tutorial and we found that one of the challenges involved is figuring out how to get the data that is sitting in a server-side js file to the front end so it can be rendered. The tutorial does not cover rendering. We found there are two ways to do this: 1) to make an ajax call from your front end to your server/backend to get the data; or 2) using a templating view engine like pug.

alfonsotech commented 7 years ago

One more thing that we learned was how to handle asynch javascript calls. There are two ways to do this: 1) one is to use promises; another 2) is to use a callback structure like this:

function getArticles(callback){ //executable code callback(articles); } }) }

GuildCrafts / web-development-js

Build a Node.js Web Crawler / Web Scraper #102

Description

Context

Specifications

Required

Quality Rubric

TODO

Will list review and edit ideas to perform on this project over the weekend here:

What are good candidates for "scrapeable" websites?

What do I do with all this data now that I've written a scraper? I'd love to see a stretch goal around this, in case folks end up find the scraping part at the lower end of their ZPD: