CottageLabs / OpenArticleGauge

Software for the OpenArticleGauge service
http://www.howopenisit.org
Other
7 stars 5 forks source link

be more careful with the HTML stripping #79

Open emanuil-tolev opened 10 years ago

emanuil-tolev commented 10 years ago

I got an awful lot of “free-to-read” back when it should have found a license. Part of the problem is that the free-to-read statements are often badges without any visible text and I suspect bleach is reducing them to a single whitespace.

Could add a check and bail if all that remains after string normalisation and then bleach is '' or ' ' .

This should, theoretically, be enough for statements which only contain HTML. It won't help much when the statements contain a couple of letters or a single word though. So additionally a length check (10 chars?) and a only-1-or-2-words-is-unacceptable check could be added (only when HTML is stripped, which is only when an exact match fails).

cameronneylon commented 10 years ago

Here's an example of one potentially problematic case:

<a href="http://pubs.acs.org/page/policy/authorchoice/index.html" title="Learn more about ACS AuthorChoice">
cameronneylon commented 10 years ago

And here's an example of something going wrong: http://oag.cottagelabs.com/lookup/10.1210/en.2012-1913

10.1210/en.2012-1913 ++ Free to Read (free-to-read)

License decided by scraping the resource at http://press.endocrine.org/doi/abs/10.1210/en.2012-1913 and looking for the following license statement: "".

BY: null. NC: null. SA: null. ND: null. OKD compliant? undefined. OSI compliant? undefined

Learn more about this license at undefined

We retrieved this information from http://press.endocrine.org/doi/abs/10.1210/en.2012-1913.

Last checked on 2014-04-13T21:01:53Z.

License detected by generic_string_matcher 0.1 plugin

cameronneylon commented 10 years ago

In the short term I've just deleted all of these from the flat file of licenses along with all the free-to-read statements. That seems to clear up most of the issues.