TyJK / EchoBurst

A browser extension that utilizes sentiment analysis to find and highlight constructive comments on various social media platforms that oppose the users worldview in order to encourage them to break out of the echo chambers the internet has allowed us to construct.
MIT License
39 stars 4 forks source link

Web Scraping #9

Closed TyJK closed 5 years ago

TyJK commented 7 years ago

Web Scraping

This issue is primarily to ensure organization of any web scraping efforts. If you are going to try to scrape a URL, mention which one it is so others don't do the same.

Instructions

Sign up for Portia, a free, visual web scraping tool. Portia lets you set up simple rules for how the spider (aka web crawler) will navigate the site, and then lets you visually mark what content you want to scrape. This pattern will then be utilized on other pages. Multiple patterns can be given to ensure proper scraping across multiple page formats. There ARE likely more efficient and clever methods of scraping, but this is the most feasible I've found that people who don't have any specialized knowledge will be able to use. If you have any of that specialized knowledge, please feel free to speak up and make suggestions.

Tutorial

Tutorial Video Portia Documentation

Important Note Make SURE that when you have the text highlighted, it's scraping text and only text. This will mean you won't have to worry about it scraping images or other undesirable content.

Also, if you are able to get all your data with only one sample (you can add to the sample by clicking the little four square icon near the minus sign), do that and name it field1. This provides a standard and makes cleaning easier. If this isn't possible though, no worries.

Running the Scraper

It's hard to tell how long the process will run for. It can take several hours to scrape one site, depending on its size, so keep that in mind when deciding how many sites you'll scrape. Once the scraper is running, it's a good idea to check the log as soon as you can to make sure that, in general, the scraper is doing what you want it to.

Uploading data

One thing that wasn't mentioned in the tutorial (woops) was how to upload. Click on the items number once it's completed, and then go to the Export button in the top right. Select "JSONL" and download the file. Then upload it to the Data folder when finished.

Thank you so much for your contribution!