ContentMine / journal-scrapers

Journal scraper definitions for the ContentMine framework
66 stars 33 forks source link

taylorfrancis.json no longer scrapes the PDF: readcube pop-up related? #54

Open rossmounce opened 7 years ago

rossmounce commented 7 years ago

I suspect this is down to the recent readcube'ization of T&F content. The PDF download button takes you to a pop-up where you can choose between real PDF or Readcube. I tried to solve this myself but I failed.

$ quickscrape --url http://dx.doi.org/10.1017/s1477201903001093  --scraper journal-scrapers/scrapers/taylorfrancis.json --output tandf -l verbose
info: quickscrape 0.4.7 launched with...
info: - URL: http://dx.doi.org/10.1017/s1477201903001093
info: - Scraper: /home/ross/Downloads/pica/journal-scrapers/scrapers/taylorfrancis.json
info: - Rate limit: 3 per minute
info: - Log level: verbose
info: urls to scrape: 1
info: processing URL: http://dx.doi.org/10.1017/s1477201903001093
debug: info [scraper]. URL rendered. http://www.tandfonline.com/doi/abs/10.1017/S1477201903001093.
debug: data [scraper]. element captured. publisher.  Taylor & Francis Group .
debug: debug [scraper]. element results. publisher.  Taylor & Francis Group .
debug: data [scraper]. element captured. journal_name. Journal of Systematic Palaeontology.
debug: debug [scraper]. element results. journal_name. Journal of Systematic Palaeontology.
debug: data [scraper]. element capture failed. volume.
debug: debug [scraper]. selector had no results. //*[@id='unit2']/div[1]/div/div/table/tbody/tr/td[1]/h3/a[1]. volume.
debug: debug [scraper]. element results. volume. .
debug: data [scraper]. element capture failed. issue.
debug: debug [scraper]. selector had no results. //*[@id='unit2']/div[1]/div/div/table/tbody/tr/td[1]/h3/a[2]. issue.
debug: debug [scraper]. element results. issue. .
debug: data [scraper]. element captured. title. Osteology and systematic position of the eocene primobucconidae (aves, coraciiformes sensu stricto), with first records from Europe.
debug: debug [scraper]. element results. title. Osteology and systematic position of the eocene primobucconidae (aves, coraciiformes sensu stricto), with first records from Europe.
debug: data [scraper]. element captured. keywords.
debug: debug [scraper]. element results. keywords. .
debug: data [scraper]. element captured. author_name.  Gerald   Mayr .
debug: data [scraper]. element captured. author_name.  Cecile   Mourer‐Chauviré .
debug: data [scraper]. element captured. author_name.  Ilka   Weidig .
debug: debug [scraper]. element results. author_name.  Gerald   Mayr , Cecile   Mourer‐Chauviré , Ilka   Weidig .
debug: data [scraper]. element captured. date_published.
debug: debug [scraper]. element results. date_published. .
debug: data [scraper]. element captured. doi. 9512127.
debug: data [scraper]. element captured. doi. 10.1017/S1477201903001093.
debug: data [scraper]. element captured. doi. Journal of Systematic Palaeontology, Vol. 2, No. 1, 2004, pp. 1-12.
debug: debug [scraper]. element results. doi. 9512127,10.1017/S1477201903001093,Journal of Systematic Palaeontology, Vol. 2, No. 1, 2004, pp. 1-12.
debug: data [scraper]. element capture failed. csv1.
debug: debug [scraper]. selector had no results. //a[@id='CSVdownloadButton'][1]. csv1.
debug: debug [scraper]. element results. csv1. .
debug: data [scraper]. element capture failed. csv2.
debug: debug [scraper]. selector had no results. //a[@id='CSVdownloadButton'][2]. csv2.
debug: debug [scraper]. element results. csv2. .
debug: data [scraper]. element capture failed. csv3.
debug: debug [scraper]. selector had no results. //a[@id='CSVdownloadButton'][3]. csv3.
debug: debug [scraper]. element results. csv3. .
debug: data [scraper]. element capture failed. csv4.
debug: debug [scraper]. selector had no results. //a[@id='CSVdownloadButton'][4]. csv4.
debug: debug [scraper]. element results. csv4. .
debug: data [scraper]. element capture failed. csv5.
debug: debug [scraper]. selector had no results. //a[@id='CSVdownloadButton'][5]. csv5.
debug: debug [scraper]. element results. csv5. .
debug: data [scraper]. element capture failed. csv6.
debug: debug [scraper]. selector had no results. //a[@id='CSVdownloadButton'][6]. csv6.
debug: debug [scraper]. element results. csv6. .
debug: data [scraper]. element captured. fulltext_html. http://dx.doi.org/10.1017/S1477201903001093.
debug: debug [scraper]. element results. fulltext_html. http://dx.doi.org/10.1017/S1477201903001093.
debug: data [scraper]. element capture failed. fulltext_pdf.
debug: debug [scraper]. selector had no results. //a[text()='PDF']. fulltext_pdf.
debug: debug [scraper]. element results. fulltext_pdf. .
debug: info [scraper]. download started. fulltext.html.
info: URL processed: captured 8/17 elements (9 captures failed)
debug: writing results to file: results.json
debug: changing back to top-level directory
info: all tasks completed