ContentMine / journal-scrapers

Journal scraper definitions for the ContentMine framework
66 stars 33 forks source link

elife scraper no longer extracts any information #51

Open rossmounce opened 7 years ago

rossmounce commented 7 years ago

It's not clear to me what's causing the problem here. eLife landing pages do appear to have the same meta tags still. Perhaps HTTPS is the issue? (a total guess)

$ quickscrape -V
0.4.7 #tarrow's version
$ quickscrape  --url https://elifesciences.org/content/5/e16800   --scraper journal-scrapers/scrapers/elife.json   --output elife
info: quickscrape 0.4.7 launched with...
info: - URL: https://elifesciences.org/content/5/e16800
info: - Scraper: /home/ross/Downloads/pica/journal-scrapers/scrapers/elife.json
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: https://elifesciences.org/content/5/e16800
info: URL processed: captured 0/19 elements (19 captures failed)
info: all tasks completed
ross@ross-x3:~/Downloads/pica$ cat elife/https_elifesciences.org_content_5_e16800/results.json 
{
  "publisher": {
    "value": []
  },
  "journal": {
    "value": []
  },
  "title": {
    "value": []
  },
  "authors": {
    "value": []
  },
  "date": {
    "value": []
  },
  "doi": {
    "value": []
  },
  "volume": {
    "value": []
  },
  "issue": {
    "value": []
  },
  "firstpage": {
    "value": []
  },
  "description": {
    "value": []
  },
  "abstract": {
    "value": []
  },
  "fulltext_html": {
    "value": []
  },
  "fulltext_pdf": {
    "value": []
  },
  "fulltext_xml": {
    "value": []
  },
  "supplementary_material": {
    "value": []
  },
  "figure": {
    "value": []
  },
  "figure_caption": {
    "value": []
  },
  "license": {
    "value": []
  },
  "copyright": {
    "value": []
  }