A bit of bad news - Githubissues

FarhadG commented 10 years ago

So, I tested out the scraper today and even though we have multiple instances, it seems to be going to need a CRAP ton of time to get the html from the many links for later parsing.

I started looking into downloading Wikipedia in HTML form, so we can still scrape them (without the throttling of GET requests) and they can no longer be obtained, however, I'm in the process of downloading them via "other" means. Even downloading them and extracting the insanely large file may take up to a day (from the sources online).

So, we can either try this out (after I finish downloading and extracting this) by scraping the links within each (cool to try), pick out a sub-section of Wikipedia for us to map out, or test out our scraper on a smaller site by recursively going through all of the links that reference within the site.

I'm sure there are other options that I haven't considered, so I'd love to get your guys' input.

@matseng @autumnfjeld @neaumusic @redwoodfavorite

solomon-gumball commented 10 years ago

Hey Farhad so I looked into it a bit and this service ----> https://proxymesh.com/ <----- might be what we need. It rotates your requests among several proxy servers. I'm signing up for a free trial now I'll try it out and see what kind of results I get.

@matseng @autumnfjeld @neaumusic @FarhadG

solomon-gumball commented 10 years ago

Okay sorry I completely missed the point of your post. Disregard the above. I vote that we tweak our scraper to work on any page, and explore the links to fully map out the site. Of course we'll have to implement some backchecking. I'll get working on it.

Still it would be kinda fun to use proxymesh somehow heheh.........

matseng commented 10 years ago

Hi Team,

Thanks for the updates!

Based on Farhad's email, it's simply not feasible to crawl all of Wikipedia. I vote for a hybrid approach to acquire data: (1) downloading files via http://dumps.wikimedia.org/enwiki/20140203/ - I'm going to check out Wiki page-to-page link records and http://stackoverflow.com/questions/12672008/wikipedia-page-to-page-links-by-pageid) and (2) using a scraper for small subsets of interesting categories and limiting the depth of the crawler.

Tomorrow I'm going to work on flow chart for our minimum viable product (MVP), download some of the Wiki SQL files to an external hard-drive, and work on some Angular. The first thing I'll do is email the outline for the MVP and ask for everyones feedback.

Best, Michael

On Tue, Feb 18, 2014 at 5:04 PM, Joseph notifications@github.com wrote:

Okay sorry I completely missed the point of your post. Disregard the above. I vote that we tweak our scraper to work on any page, and explore the links to fully map out the site. Of course we'll have to implement some backchecking. I'll get working on it.

Still it would be kinda fun to use proxymesh somehow heheh.........

Reply to this email directly or view it on GitHubhttps://github.com/WikiMapper/WikiViz/issues/22#issuecomment-35455222 .

FarhadG commented 10 years ago

Looking forward to that, Michael. I've downloaded the Wikipedia XML and after extracting it, it's about 50GB worth of information (excluding images). Next up is figuring out how to open these via Mac in a timely fashion (from the little research I've done so far, it could take up to a day or more to have these rendered?).

I'll keep you guys updated on this.

But, the sub-section would be a great idea and perhaps more focused. That said, down-the-road, we could do the entire Wikipedia and submit it to Hacker News, as the project is much larger than I had expected (not many have done so).

Anyways, I think Joseph's idea about having a smaller to medium site being fully scraped on our demo day would be AWESOME. So, if we can get this sub-category section done (whatever the MVP is), it would be cool if we could submit a smaller site and have the entire thing mapped out to our database, as well. Just some food for thought (perhaps more wishful thinking).

Sub-categories are listed here on Wikipedia (e.g. http://en.wikipedia.org/wiki/Portal:Technology) so we could figure out a sub-section of Wikipedia that we would interest the class and have that mapped out.

Looking forward to seeing this LIVE :)

matseng commented 10 years ago

Hi Team,

Please checkout the MVP outline here: https://github.com/WikiMapper/WikiViz/issues/24

Feel free to make revisions or post comments / questions. Looking forward to everyone's feedback!

Best, M

On Wed, Feb 19, 2014 at 11:05 AM, Farhad Ghayour notifications@github.comwrote:

Looking forward to that, Michael. I've downloaded the Wikipedia XML and after extracting it, it's about 50GB worth of information (excluding images). Next up is figuring out how to open these via Mac in a timely fashion (from the little research I've done so far, it could take up to a day or more to have these rendered?).

I'll keep you guys updated on this.

But, the sub-section would be a great idea and perhaps more focused. That said, down-the-road, we could do the entire Wikipedia and submit it to Hacker News, as the project is much larger than I had expected (not many have done so).

Anyways, I think Joseph's idea about having a smaller to medium site being fully scraped on our demo day would be AWESOME. So, if we can get this sub-category section done (whatever the MVP is), it would be cool if we could submit a smaller site and have the entire thing mapped out to our database, as well. Just some food for thought (perhaps more wishful thinking).

Sub-categories are listed here on Wikipedia (e.g. http://en.wikipedia.org/wiki/Portal:Technology) so we could figure out a sub-section of Wikipedia that we would interest the class and have that mapped out.

Looking forward to seeing this LIVE :)

Reply to this email directly or view it on GitHubhttps://github.com/WikiMapper/WikiViz/issues/22#issuecomment-35535226 .

matseng commented 10 years ago

Hi Cam,

Are you available this morning, e.g. 11am, to meet for a code review? Or suggest another time? Our group name is WikiViz.

Thanks!! Mike

On Wed, Feb 26, 2014 at 1:27 PM, Cameron Boehmer cameron@hackreactor.comwrote:

O Capable Committers of Code,

You will submit your projects to me for code reviews over the coming days and weeks.

Before you do so, consult the Project Checklisthttps://github.com/hackreactor/curriculum/wiki/Projects:-The-Checklist and the Code Review Checklisthttps://github.com/hackreactor/curriculum/wiki/Projects:-Every-Code-Review-Ever .

Failure to do so will result in your project sinking into a swamp Dagobah.

All, that is.

Cameron

To unsubscribe from this group and stop receiving emails from it, send an email to hr9+unsubscribe@hackreactor.com.

matseng commented 10 years ago

The first step is to send me your repo so I can leave comments via issues with line references. Would be happy to follow that with a face-to-face :)

On Fri, Feb 28, 2014 at 10:03 AM, Michael Tseng matseng@gmail.com wrote:

Hi Cam,

Are you available this morning, e.g. 11am, to meet for a code review? Or suggest another time? Our group name is WikiViz.

Thanks!! Mike

On Wed, Feb 26, 2014 at 1:27 PM, Cameron Boehmer cameron@hackreactor.comwrote:

O Capable Committers of Code,

You will submit your projects to me for code reviews over the coming days and weeks.

Before you do so, consult the Project Checklisthttps://github.com/hackreactor/curriculum/wiki/Projects:-The-Checklist and the Code Review Checklisthttps://github.com/hackreactor/curriculum/wiki/Projects:-Every-Code-Review-Ever .

Failure to do so will result in your project sinking into a swamp Dagobah.

All, that is.

Cameron

To unsubscribe from this group and stop receiving emails from it, send an email to hr9+unsubscribe@hackreactor.com.

WikiMapper / WikiViz

A bit of bad news #22