SvenAG / SNLP-Final-Project

SNLP Final Project
0 stars 0 forks source link

HTML Extraction #7

Open rob-nyu opened 10 years ago

rob-nyu commented 10 years ago

Extract text from html tags in the raw data.

rob-nyu commented 10 years ago

First pass through gave us the text from the following tags:

title h1 h2 h3 strong b a img - gives "alt" and "title" of an image meta_description - gives description as written by webpage author meta_keywords - gives keywords as given by webpage author boilerplate - from the training data summary - using boilerpipe package on all page content to get a summary

rob-nyu commented 10 years ago

WIll look into extracting all paragraph content