ContentMine / ami

Apache License 2.0
13 stars 14 forks source link

AMI wordFrequencies doesn't recognize CSS #56

Open larsgw opened 8 years ago

larsgw commented 8 years ago

When using the standard AMI command, used in the tutorial (ami2-word --project PROJECTNAME -i scholarly.html --w.words wordFrequencies --w.stopwords stopwords.txt), where stopwords.txt is copied from the ami2-0.1-SNAPSHOT.jar, the outputted data contains parts of CSS found in the style tag in scholarly.html.

Example of the output: header28

Input was the scholarly.html created with norma from PMC4350396.

petermr commented 8 years ago

Agreed this is a bug. It may be fixed in the dev branch...

On Sun, May 29, 2016 at 11:19 AM, larsgw notifications@github.com wrote:

When using the standard AMI command, used in the tutorial (ami2-word --project PROJECTNAME -i scholarly.html --w.words wordFrequencies --w.stopwords stopwords.txt), where stopwords.txt is copied from the ami2-0.1-SNAPSHOT.jar, the outputted data contains parts of CSS found in the style tag in scholarly.html.

Example of the output: [image: header28] https://cloud.githubusercontent.com/assets/14018963/15632566/f80289f0-2596-11e6-98b2-b677ff9d5abe.png

Input was the scholarly.html created with norma from PMC4350396.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ContentMine/ami/issues/56, or mute the thread https://github.com/notifications/unsubscribe/AAsxS_ybLscb1AtSDIzZpoaNRIlI6LvLks5qGWgkgaJpZM4IpQkj .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069