ckreibich / scholar.py

A parser for Google Scholar, written in Python
2.1k stars 777 forks source link

Added extraction of url_pdf from right hand side [PDF] link. #95

Open pmdscully opened 7 years ago

pmdscully commented 7 years ago

This change will extract the [PDF] href value from the right hand side of a Google Scholar article entry. It will record the URL as url_pdf if the article's url_pdf hasn't already been filled and Google scholar labels the link as a PDF (i.e. the element's text is [PDF]).

Test: scholar.py -c 10 --txt --author "einstein" --phrase "quantum"

Pre-change: 0/4 PDF links extracted Post-change: 4/4 PDF links extracted

As far as I am aware Google Scholar's [PDF] label is the best, easily available indicator of whether the (optional) right hand side anchor refers to a PDF file.