ContentMine / journal-scrapers

Journal scraper definitions for the ContentMine framework
66 stars 33 forks source link

ACS, AAAS, Springer, Taylor & Francis, and Wiley scrapers #23

Closed pbulsink closed 9 years ago

pbulsink commented 9 years ago

Not all of these work, they require quickscrape to follow relative links (expected in v1.0).

coveralls commented 9 years ago

Coverage Status

Changes Unknown when pulling d99cece8adb272b2f0dc7282c68057cf547c9b48 on pbulsink:ready_to_pull into \ on ContentMine:master**.

blahah commented 9 years ago

You are awesome :)

blahah commented 9 years ago

How come you removed the Nature one, btw?

pbulsink commented 9 years ago

Nature seems to crash the scraper when you run it without access. I'm not sure why. Different things happen, sometimes it just hangs and stops, sometimes it throws an error. Running with loglevel verbose doesn't explain it either.

Not sure if it's a scraper.json issue or a quickscrape issue, so I removed it prior to the pull request to keep the test passing happy, but document what I've tried.

pbulsink commented 9 years ago

Taylor and Francis fails tests sometimes when the site gets mad about not being able to set cookies. Instead of rendering the article page, it throws an error page:

...
    <h1>An Error Occurred Setting Your User Cookie</h1>
    <p>This site uses cookies to improve performance. If your browser does not accept cookies, you cannot view this site.</p>
...
blahah commented 9 years ago

OK, looks like T&F will have to be headless. Upcoming version of scraperJSON will allow setting headless on/off.

blahah commented 9 years ago

Could you make a separate PR with the nature scraper so I can debug it cleanly?

pbulsink commented 9 years ago

Nature is in Pull Request 24 --> https://github.com/ContentMine/journal-scrapers/pull/24

coveralls commented 9 years ago

Coverage Status

Changes Unknown when pulling b6ba671d2f4e666f35095f31207df5701635a6ba on pbulsink:ready_to_pull into \ on ContentMine:master**.

coveralls commented 9 years ago

Coverage Status

Changes Unknown when pulling 159ec314ef69a24bcf50a48be98e44153c935831 on pbulsink:ready_to_pull into \ on ContentMine:master**.

coveralls commented 9 years ago

Coverage Status

Changes Unknown when pulling e6c738d6b942be85b02f5758c8aa3e65c9d59800 on pbulsink:ready_to_pull into \ on ContentMine:master**.