GuildCrafts / web-development-js

Craft repository for Web Development with JavaScript
http://jsdev.learnersguild.org/
Other
1 stars 1 forks source link

Build a Node.js Web Crawler / Web Scraper #102

Open lumodon opened 7 years ago

lumodon commented 7 years ago

Description

"Web scraping is a technique in data extraction where you pull information from websites." *1 Create a web scraper which gathers information from the web. The tutorial listed below will take you through, step by step, in setting up your own crawler. You should modify this tutorial to fit some subject of interest to you.

Get with your team and brainstorm what types of information or data would be a good fit for scraping? Note that information which is presented such as images in a list of pokemon is best suited, and not, for example, stocks from the stock market, or other types of data which is fetched from a singular database (In which case, you would be better off hoping the host has an API such as Google Books API)

Context

There are many common / practical uses to this, and this is a technique employed by many companies these days. What types of data are a good fit for scraping? Why some types and not others? This project also gives you a great practical example of the limits that efficient code can bring, since you will be pushing the limit of a process when your crawler is running.

Specifications

Using https://scotch.io/tutorials/scraping-the-web-with-node-js as a resource, optionally with the same libraries, preferably explore alternative libraries that will still allow you to accomplish the same end result.


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

*1: Kukic, Ado. "Scraping the Web With Node.js." Scotch. The Scotch Family, 13 Mar. 2014. Web. https://scotch.io/tutorials/scraping-the-web-with-node-js. Accessed 26 Oct. 2016.

TODO

Will list review and edit ideas to perform on this project over the weekend here:

deonna commented 7 years ago

I like the gist of this goal, and think it would be incredibly fun to work on. It'd also be a great (and perhaps frustrating) exercise in learning how to debug your code line-by-line.

Here are some questions that came up for me:

What are good candidates for "scrapeable" websites?

What do I do with all this data now that I've written a scraper? I'd love to see a stretch goal around this, in case folks end up find the scraping part at the lower end of their ZPD:

I think tweaking your spec to answer some of these questions would make it a incredibly fun and valuable project to work on.

I'm kind of inspired to write an API-related goal after reading yours...looking forward to seeing folks work on it. :D

deonna commented 7 years ago

@lumodon - I found an egghead.io resource for web scraping with X-ray that looks pretty good (and it's relatively short). Problem is, most videos require a pro account -- I can log into mine if you or anyone else wants to watch them. Lemme know!

alfonsotech commented 7 years ago

I'm totally interested in working on this goal, and I would propose scraping the Stanford Encyclopedia of Philosophy (lots of data (entries) formatted in the same way, no public API). I built this as a way to get at the entries because I did't know how to build a scraper, but scrapping would be ideal for this project.

alfonsotech commented 7 years ago

@qweenwasabi and @alfonsotech were doing this goal using the linked tutorial and we found that one of the challenges involved is figuring out how to get the data that is sitting in a server-side js file to the front end so it can be rendered. The tutorial does not cover rendering. We found there are two ways to do this: 1) to make an ajax call from your front end to your server/backend to get the data; or 2) using a templating view engine like pug.

alfonsotech commented 7 years ago

One more thing that we learned was how to handle asynch javascript calls. There are two ways to do this: 1) one is to use promises; another 2) is to use a callback structure like this:

function getArticles(callback){ //executable code callback(articles); } }) }