BIDS-projects / scraper

Collects data from websites of data science institutions
2 stars 0 forks source link

Improve text collection #7

Closed don-han closed 8 years ago

don-han commented 8 years ago

Currently it collects all the texts within a webpage including all the white spaces and other non-essential information. We need to find a pattern so what we can remove from webpages.

alvinwan commented 8 years ago

We can use BeautifulSoup to easily select data we want.

louienicholaslee commented 8 years ago

We dont necessarily have to collect all the data for lda, just enough so that we get a good representative. So if we have a good random sampling we can have a good enough approximation, at least for initial results.

On Friday, January 8, 2016, Alvin Wan notifications@github.com wrote:

We can use BeautifulSoup to easily select data we want.

— Reply to this email directly or view it on GitHub https://github.com/BIDS-projects/web-scraper/issues/7#issuecomment-169891007 .

don-han commented 8 years ago

Sorry in advance for the wall of text, but this is the current text collection from dlab.berkeley.edu (single webpage) in the form of Python list: You can see we have a lot of unnecessary data, but with enough cleaning, we should be left with texts like "Supporting research instruction wherever it occurs ', u'Researchers learn about new data, software and techniques in classrooms and lecture halls, but they also learn in online courses, through webinars, at personalized workshops, during seminars and brownbags, and through one-on-one consultations and discussions. D-Lab seeks to support those learning interactions, wherever and however they take place" or "Mondays 1-3pm, and by appointment', u'I love to teach and consult for novice programmers, especially in Python: I can help if you are just getting started with programming. I could also be useful if your data currently lives in something like Excel and you want to analyze and visualize your data in a reproducible, efficient way" which should tell us that D-Lab focuses on "consulting" and "workshops" I will run my own data cleaning and report to @louienicholaslee to see what else needs to be done.

u'\n\n', u'\n ', u'\n', u'\n', u'\n', u'\n ', u'Home | D-Lab', u'\n ', u'\n@import url("http://dlab.berkeley.edu/modules/system/system.base.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/modules/system/system.menus.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/modules/system/system.messages.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/modules/system/system.theme.css?nxlpn8");\n', u'\n', u'\n@import url("http://dlab.berkeley.edu/misc/ui/jquery.ui.core.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/misc/ui/jquery.ui.theme.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/misc/ui/jquery.ui.button.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/misc/ui/jquery.ui.resizable.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/misc/ui/jquery.ui.dialog.css?nxlpn8");\n', u'\n', u'\n@import url("http://dlab.berkeley.edu/sites/all/modules/calendar/css/calendar_multiday.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/modules/comment/comment.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/modules/date/date_api/date.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/modules/date/date_popup/themes/datepicker.1.7.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/modules/date/date_repeat_field/date_repeat_field.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/modules/field/theme/field.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/modules/logintoboggan/logintoboggan.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/modules/node/node.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/modules/search/search.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/modules/user/user.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/modules/views/css/views.css?nxlpn8");\n', u'\n', u'\n@import url("http://dlab.berkeley.edu/sites/all/modules/colorbox/styles/default/colorbox_default_style.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/modules/ctools/css/ctools.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/modules/panels/css/panels.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/libraries/syntaxhighlighter_3.0.83/styles/shCore.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/libraries/syntaxhighlighter_3.0.83/styles/shThemeDefault.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/modules/views_slideshow/views_slideshow.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/modules/panels/plugins/layouts/threecol_33_34_33/threecol_33_34_33.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/modules/syntaxhighlighter_insert/syntaxhighlighter_insert_wysiwyg/syntaxhighlighter_insert_wysiwyg.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/modules/webform/css/webform.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/modules/views_slideshow/views_slideshow_controls_text.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/modules/views_slideshow/contrib/views_slideshow_cycle/views_slideshow_cycle.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/modules/panels/plugins/layouts/flexible/flexible.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/default/files/ctools/css/5163dc43e6c44826d81441d2686c43f9.css?nxlpn8");\n', u'\n', u'\n@import url("http://dlab.berkeley.edu/sites/all/themes/planetta/css/bootstrap.min.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/themes/planetta/css/main.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/themes/planetta/css/custom.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/all/themes/planetta/css/lightbox.css?nxlpn8");\n', u'\n', u'\n@import url("http://dlab.berkeley.edu/sites/default/files/fontyourface/wysiwyg.css?nxlpn8");\n@import url("http://dlab.berkeley.edu/sites/default/files/fontyourface/font.css?nxlpn8");\n', u'\n', u'\n', u'\n@import url("http://dlab.berkeley.edu/sites/default/files/fontyourface/local_fonts/Scount_Light-normal-normal/stylesheet.css?nxlpn8");\n', u'\n ', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n<![CDATA[//><!]]>\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n<![CDATA[//><!]]>\n', u'\n', u'\n', u'\n ', u'\n ', u'Skip to main content', u'\n ', u'\n \r\n\r\n', u'\r\n ', u'\r\n ', u'\r\n ', u'\r\n ', u'\r\n ', u'\r\n ', u'\r\n ', u'\r\n \r\n ', u'Intelligent research design for data intensive social science', u'\r\n \r\n ', u'\r\n ', u'\r\n ', u'\r\n\r\n ', u'About', u'About D-Lab', u'Staff', u'Dav Clark', u'FAQ', u'Contact Us', u'Join Us', u'Donate', u'Services', u'Training', u'Past Trainings', u'Consulting', u'Working Groups', u'Space', u'Resources', u'Data Resources', u'Campus Resources', u'Course List', u'Blog & Events', u'Blog', u'Campus Events', u'Calendar', u' ', u'\r\n ', u'\r\n', u'\r\n\r\n\r\n ', u'\r\n ', u'\r\n ', u'\r\n ', u'Intelligent research design for the age of data intensive social science.', u' ', u'\r\n ', u'\r\n ', u'\r\n\r\n\r\n ', u'\r\n ', u'\r\n ', u'\r\n ', u'\r\n ', u'\r\n ', u' \r\n ', u'\r\n ', u'\r\n ', u'\n ', u'\n\n \n ', u'\n ', u'\n \n \n \n ', u'\n \n', u'\n \n ', u'\n ', u'\n ', u'\n \n ', u' ', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u' ', u' \n ', u' ', u'Helping social scientists collect, process, and visualize data', u"Are you starting research or working on a project that uses data? Are you a data visualization expert looking for access to new data sets? D-Lab's collaborative environment caters to many types of data needs. The tools, methods, and techniques that D-Lab provides offer social scientists the ability to engage with complex research questions and produce answers that benefit academic colleagues, policymakers, and the public.\xa0", u'\n', u' ', u'\n', u'\n', u'\n ', u'\n \n ', u' ', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u' ', u' \n ', u' ', u'A flexible infrastructure of hardware, software, and above all human talent', u'D-Lab is a new lab that aims to provide services, support, and a venue for research design and experimentation in data-intensive social sciences.', u'\n', u' ', u'\n', u'\n', u'\n ', u'\n \n ', u' ', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u' ', u' \n ', u' ', u'Supporting research instruction wherever it occurs ', u'Researchers learn about new data, software and techniques in classrooms and lecture halls, but they also learn in online courses, through webinars, at personalized workshops, during seminars and brownbags, and through one-on-one consultations and discussions. D-Lab seeks to support those learning interactions, wherever and however they take place.', u'\n', u' ', u'\n', u'\n', u'\n', u'\n ', u'\n ', u'\n ', u'Previous', u'\n ', u'Pause', u'\n ', u'Next', u'\n', u'\n ', u'\n ', u'\n ', u'\n \n \n \n \n \n \n', u' ', u'\n', u'\n ', u'\n ', u'\r\n ', u'\r\n ', u'\r\n ', u'\r\n ', u'\r\n\r\n\r\n\r\n', u'\r\n ', u'\r\n ', u'\r\n\r\n \r\n\r\n ', u'\r\n \r\n ', u' \r\n ', u'\n ', u'\n\n \n ', u'\n ', u'\n', u'\n', u'\n ', u'\n ', u'\n', u'\n', u'\n ', u'\n', u'\n ', u'\n ', u'\n', u'\n', u'\n ', u'\n ', u'\n', u'\n', u'\n ', u'\n ', u'\n', u'\n ', u'\n', u'\n', u'\n', u'\n ', u'\n', u'\n\n', u'\n ', u'\n ', u'\n \n ', u'Upcoming Trainings', u'\n \n \n ', u'\n ', u'\n \n \n \n ', u'\n ', u'\n \n ', u' ', u' ', u' \n ', u' ', u'11-Jan-16 10:00am', u' ', u' \n ', u' ', u'INTENSIVE: QDA Day 1 - Qualitative Data Analysis with NVivo', u' ', u' ', u'\n ', u'\n \n ', u' ', u' ', u' \n ', u' ', u'12-Jan-16 10:00am', u' ', u' \n ', u' ', u'INTENSIVE: QDA Day 2 - Intro to Qualitative Data Analysis', u' ', u' ', u'\n ', u'\n \n ', u' ', u' ', u' \n ', u' ', u'12-Jan-16 1:00pm', u' ', u' \n ', u' ', u'INTENSIVE: R for Data Science Day 1 (basics of R)', u' ', u' ', u'\n ', u'\n \n \n \n \n', u'\n ', u'\n See more ', u'\n', u'\n \n \n \n', u' ', u'\n\n \n ', u'\n', u'\n \n ', u'Sign Up for Our Mailing List', u'\n \n \n ', u'\n ', u'\n ', u'Keep up to date about the latest events, trainings, and news from the D-Lab!', u'\n\n', u'\n', u'\n ', u'Email ', u'*', u'\n ', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u' ', u'\n\n \n ', u'\n', u'\n ', u'\n\n ', u'\n ', u'\n \n ', u'Happy Holidays!', u'\n \n \n ', u'\n ', u'The D-Lab will be closed December 21 thru January 1.', u'\n', u' See our ', u'calendar', u' for details of upcoming workshops.', u'\n ', u'\n\n \n ', u'\n', u'\n \n ', u'Blog', u'\n \n \n ', u'\n ', u'\n \n \n \n ', u'\n ', u'\n \n ', u' ', u' ', u' \n ', u' ', u'The Season for Sharing Data: Working with the newly released Census 2010-2014 ACS 5 year data in R', u' ', u' \n ', u' ', u'On December 3, 2015 the U.S. Census Bureau released the 2010-2014 5 year ACS (American Community Survey) data.', u' ', u' ', u'\n ', u'\n \n ', u' ', u' ', u' \n ', u' ', u'Are you an R Hero? Join our Team!', u' ', u' \n ', u' ', u'The D-Lab is hiring\xa0R\xa0instructors\xa0for the Spring semester to teach beginning and intermediate classes in data visualization and analysis!', u' ', u' ', u'\n ', u'\n \n \n \n \n', u'\n ', u'\n See more ', u'\n', u'\n \n \n \n', u' ', u'\n\n \n ', u'\n', u'\n ', u'\n\n ', u'\n ', u'\n \n ', u'Featured Working Group', u'\n \n \n ', u'\n ', u'\n \n \n \n ', u'\n ', u'\n \n ', u' ', u' ', u' \n ', u' ', u' ', u' \n ', u' ', u' Research in Practice Working Group', u' ', u' \n ', u' ', u"So you\u2019ve got some of your graduate classes under your belt, and it's time to begin an original research project.\xa0 But how exactly are you going to go about surveying voters in Tanzania, interviewi", u' ', u' \n ', u' ', u'TBD for Spring 2016, Kickoff meeting 12/3/15 at 4pm', u' ', u' ', u'\n ', u'\n \n \n \n \n', u'\n ', u'\n See more ', u'\n', u'\n \n \n \n', u' ', u'\n\n \n ', u'\n', u'\n \n ', u'Featured Consultant', u'\n \n \n ', u'\n ', u'\n \n \n \n ', u'\n ', u'\n \n ', u' ', u' ', u' \n ', u' ', u'Kunal Marwaha', u' ', u' \n ', u' ', u'Python (Beginning, Data Cleaning/Preparation, Workflow Automation)', u' ', u' \n ', u' ', u' Mondays 1-3pm, and by appointment', u'I love to teach and consult for novice programmers, especially in Python: I can help if you are just getting started with programming. I could also be useful if your data currently lives in something like Excel and you want to analyze and visualize your data in a reproducible, efficient way.', u'\n', u' ', u' ', u'\n ', u'\n \n \n \n \n', u'\n ', u'\n See more ', u'\n', u'\n \n \n \n', u' ', u'\n\n \n ', u'\n', u'\n ', u'\n', u'\n ', u'\n ', u'\r\n ', u'\r\n \r\n ', u'\r\n\r\n\r\n ', u'\r\n ', u'\r\n', u'\r\n\r\n\r\n ', u'\r\n ', u'\r\n ', u'\r\n ', u'\r\n ', u'\n ', u'\n\n \n ', u'\n ', u'About', u'\xa0|\xa0', u'Contact', u'\xa0|\xa0', u'FAQ', u'\xa0|\xa0', u'Location', u'\xa0|\xa0', u'Work for Us', u'\n ', u'\n', u'\n', u'\n\n \n ', u'\n ', u'D-Lab | University of California, Berkeley | 350 Barrows Hall Berkeley, CA 94720-3030 | ', u'dlab@berkeley.edu', u'\n ', u'\n', u'\n ', u'\n\r\n ', u'\r\n \r\n ', u'\r\n ', u'\n ', u'\n\n \n ', u'\n ', u'\xa0\xa0', u'\n ', u'\n', u'\n ', u'\n ', u'\r\n \r\n ', u'\r\n ', u'\n ', u'\n\n ', u'Connect with us', u'\n \n ', u'\n ', u'\xa0', u'Facebook', u'\n', u'\xa0', u'Twitter', u'\n', u'\xa0', u'RSS', u'\n ', u'\n', u'\n ', u'\n ', u'\r\n \r\n ', u'\r\n ', u'\r\n ', u'\r\n\r\n\r\n ', u'\n', u'\n']

don-han commented 8 years ago

@louienicholaslee and @alvinwan:

The list of strings above will be condensed to the following list of strings.

[u'Skip to main content', u'Intelligent research design for data intensive social science', u'About', u'About D-Lab', u'Staff', u'Dav Clark', u'FAQ', u'Contact Us', u'Join Us', u'Donate', u'Services', u'Training', u'Past Trainings', u'Consulting', u'Working Groups', u'Space', u'Resources', u'Data Resources', u'Campus Resources', u'Course List', u'Blog & Events', u'Blog', u'Campus Events', u'Calendar', u'Intelligent research design for the age of data intensive social science.', u'Helping social scientists collect, process, and visualize data', u"Are you starting research or working on a project that uses data? Are you a data visualization expert looking for access to new data sets? D-Lab's collaborative environment caters to many types of data needs. The tools, methods, and techniques that D-Lab provides offer social scientists the ability to engage with complex research questions and produce answers that benefit academic colleagues, policymakers, and the public.", u'A flexible infrastructure of hardware, software, and above all human talent', u'D-Lab is a new lab that aims to provide services, support, and a venue for research design and experimentation in data-intensive social sciences.', u'Supporting research instruction wherever it occurs', u'Researchers learn about new data, software and techniques in classrooms and lecture halls, but they also learn in online courses, through webinars, at personalized workshops, during seminars and brownbags, and through one-on-one consultations and discussions. D-Lab seeks to support those learning interactions, wherever and however they take place.', u'Previous', u'Pause', u'Next', u'Upcoming Trainings', u'14-Jan-16 10:00am', u'INTENSIVE: QDA Day 4 - From Coding Qualitative Data to Analyzing It', u'14-Jan-16 12:30pm', u'INTENSIVE: Stata', u'14-Jan-16 1:00pm', u'INTENSIVE: R for Data Science Day 3 (analyzing data)', u'See more', u'Sign Up for Our Mailing List', u'Keep up to date about the latest events, trainings, and news from the D-Lab!', u'Email', u'*', u'Happy Holidays!', u'The D-Lab will be closed December 21 thru January 1.', u'See our', u'calendar', u'for details of upcoming workshops.', u'Blog', u'The Season for Sharing Data: Working with the newly released Census 2010-2014 ACS 5 year data in R', u'On December 3, 2015 the U.S. Census Bureau released the 2010-2014 5 year ACS (American Community Survey) data.', u'Are you an R Hero? Join our Team!', u'The D-Lab is hiring\xa0R\xa0instructors\xa0for the Spring semester to teach beginning and intermediate classes in data visualization and analysis!', u'See more', u'Featured Working Group', u'Research in Practice Working Group', u"So you\u2019ve got some of your graduate classes under your belt, and it's time to begin an original research project.\xa0 But how exactly are you going to go about surveying voters in Tanzania, interviewi", u'TBD for Spring 2016, Kickoff meeting 12/3/15 at 4pm', u'See more', u'Featured Consultant', u'Kunal Marwaha', u'Python (Beginning, Data Cleaning/Preparation, Workflow Automation)', u'Mondays 1-3pm, and by appointment', u'I love to teach and consult for novice programmers, especially in Python: I can help if you are just getting started with programming. I could also be useful if your data currently lives in something like Excel and you want to analyze and visualize your data in a reproducible, efficient way.', u'See more', u'About', u'|', u'Contact', u'|', u'FAQ', u'|', u'Location', u'|', u'Work for Us', u'D-Lab | University of California, Berkeley | 350 Barrows Hall Berkeley, CA 94720-3030 |', u'dlab@berkeley.edu', u'Connect with us', u'Facebook', u'Twitter', u'RSS']

If the body of text is easier to process than a list, I can join the strings into the following:

u"Skip to main content Intelligent research design for data intensive social science About About D-Lab Staff Dav Clark FAQ Contact Us Join Us Donate Services Training Past Trainings Consulting Working Groups Space Resources Data Resources Campus Resources Course List Blog & Events Blog Campus Events Calendar Intelligent research design for the age of data intensive social science. Helping social scientists collect, process, and visualize data Are you starting research or working on a project that uses data? Are you a data visualization expert looking for access to new data sets? D-Lab's collaborative environment caters to many types of data needs. The tools, methods, and techniques that D-Lab provides offer social scientists the ability to engage with complex research questions and produce answers that benefit academic colleagues, policymakers, and the public. A flexible infrastructure of hardware, software, and above all human talent D-Lab is a new lab that aims to provide services, support, and a venue for research design and experimentation in data-intensive social sciences. Supporting research instruction wherever it occurs Researchers learn about new data, software and techniques in classrooms and lecture halls, but they also learn in online courses, through webinars, at personalized workshops, during seminars and brownbags, and through one-on-one consultations and discussions. D-Lab seeks to support those learning interactions, wherever and however they take place. Previous Pause Next Upcoming Trainings 14-Jan-16 10:00am INTENSIVE: QDA Day 4 - From Coding Qualitative Data to Analyzing It 14-Jan-16 12:30pm INTENSIVE: Stata 14-Jan-16 1:00pm INTENSIVE: R for Data Science Day 3 (analyzing data) See more Sign Up for Our Mailing List Keep up to date about the latest events, trainings, and news from the D-Lab! Email * Happy Holidays! The D-Lab will be closed December 21 thru January 1. See our calendar for details of upcoming workshops. Blog The Season for Sharing Data: Working with the newly released Census 2010-2014 ACS 5 year data in R On December 3, 2015 the U.S. Census Bureau released the 2010-2014 5 year ACS (American Community Survey) data. Are you an R Hero? Join our Team! The D-Lab is hiring\xa0R\xa0instructors\xa0for the Spring semester to teach beginning and intermediate classes in data visualization and analysis! See more Featured Working Group Research in Practice Working Group So you\u2019ve got some of your graduate classes under your belt, and it's time to begin an original research project.\xa0 But how exactly are you going to go about surveying voters in Tanzania, interviewi TBD for Spring 2016, Kickoff meeting 12/3/15 at 4pm See more Featured Consultant Kunal Marwaha Python (Beginning, Data Cleaning/Preparation, Workflow Automation) Mondays 1-3pm, and by appointment I love to teach and consult for novice programmers, especially in Python: I can help if you are just getting started with programming. I could also be useful if your data currently lives in something like Excel and you want to analyze and visualize your data in a reproducible, efficient way. See more About | Contact | FAQ | Location | Work for Us D-Lab | University of California, Berkeley | 350 Barrows Hall Berkeley, CA 94720-3030 | dlab@berkeley.edu Connect with us Facebook Twitter RSS"

Let me know if you think I am missing some data from the above post, or if there is more strings I should remove.

alvinwan commented 8 years ago

@don-han Nice lgtm.