Open rob-nyu opened 10 years ago
First pass through gave us the text from the following tags:
title h1 h2 h3 strong b a img - gives "alt" and "title" of an image meta_description - gives description as written by webpage author meta_keywords - gives keywords as given by webpage author boilerplate - from the training data summary - using boilerpipe package on all page content to get a summary
WIll look into extracting all paragraph content
Extract text from html tags in the raw data.