ckreibich / scholar.py

A parser for Google Scholar, written in Python
2.1k stars 777 forks source link

Running the bibtex example with python3 from the README.md yields no result #101

Open jessebrennan opened 6 years ago

jessebrennan commented 6 years ago

Adding debug flags and we see:

$ python3 scholar.py -c 1 --author "albert einstein" --phrase "quantum theory" --citation bt -ddd
[ INFO]  using log level 3
[ INFO]  requesting http://scholar.google.com/scholar_settings?sciifh=1&hl=en&as_sdt=0,5
[ INFO]  parsing settings failed: no form
[ INFO]  requesting http://scholar.google.com/scholar?as_q=&as_epq=quantum theory&as_oq=&as_eq=&as_occt=any&start=&as_sauthors=albert einstein&as_publication=&as_ylo=&as_yhi=&as_vis=0&btnG=&hl=en&num=1&as_sdt=0,5

$

The first example seems to work fine. AFAIK this error seems to be due to a website layout change from Google Scholar.

brittAnderson commented 6 years ago

I can confirm this error fails for all the citation types and breaks the ability to use the command to export citations. It must be something in the parser, but I don't have the skills to know where to start to puzzle it out. If you give me some hints I could do some further investigating.

portalgun commented 6 years ago

The settings page on google scholar has changed. Line 985 needs to be changed to:
tag = soup.find(name='form', attrs={'id': 'gs_bdy_frm'}) After inserting this, scholar settings are successfully saved. However, it then returns:

Traceback (most recent call last):
  File "/home/dambam/bin/papers/scholar.py", line 1311, in <module>
    sys.exit(main())
  File "/home/dambam/bin/papers/scholar.py", line 1301, in main
    citation_export(querier)
  File "/home/dambam/bin/papers/scholar.py", line 1146, in citation_export
    print(art.as_citation() + '\n')

This can be fixed by changing 1145 to:
print(art.as_citation() + "\n".encode('ascii'))

scholar.py --citation=bt -a "einstein" will then return: b'@article{einstein1935can,\n title={Can quantum-mechanical description of physical reality be considered complete?},\n author={Einstein, Albert and Podolsky, Boris and Rosen, Nathan},\n journal={Physical review},\n volume={47},\n number={10},\n pages={777},\n year={1935},\n publisher={APS}\n}\n\n'

portalgun commented 6 years ago

Also, if you have done too many searches to quickly, google scholar will ask for captcha. Scholar.py won't explicitly indicate anything is wrong in this situation, other than printing nothing. If you are getting at least blank lines, then you probably have the problem associated with my last comment.

jessebrennan commented 6 years ago

@portalgun Do you want to make a PR for these changes? Even though no one merges them it could be helpful for other people who run into the problem to just download the patch. If not, I can gladly make the PR.

orangewords commented 6 years ago

Thanks to those of you who have suggested solutions. I've tried the two edits you suggested but am facing all of these traceback errors.

`--------------------------------------------------------------------------- BadOptionError Traceback (most recent call last) //anaconda/lib/python3.6/optparse.py in parse_args(self, args, values) 1386 try: -> 1387 stop = self._process_args(largs, rargs, values) 1388 except (BadOptionError, OptionValueError) as err:

//anaconda/lib/python3.6/optparse.py in _process_args(self, largs, rargs, values) 1430 # value(s) for the last one only) -> 1431 self._process_short_opts(rargs, values) 1432 elif self.allow_interspersed_args:

//anaconda/lib/python3.6/optparse.py in _process_short_opts(self, rargs, values) 1512 if not option: -> 1513 raise BadOptionError(opt) 1514 if option.takes_value():

BadOptionError: no such option: -f

During handling of the above exception, another exception occurred:

SystemExit Traceback (most recent call last)

in () 1269 1270 if __name__ == "__main__": -> 1271 sys.exit(main()) in main() 1180 parser.add_option_group(group) 1181 -> 1182 options, _ = parser.parse_args() 1183 1184 # Show help if we have neither keyword search nor author name //anaconda/lib/python3.6/optparse.py in parse_args(self, args, values) 1387 stop = self._process_args(largs, rargs, values) 1388 except (BadOptionError, OptionValueError) as err: -> 1389 self.error(str(err)) 1390 1391 args = largs + rargs //anaconda/lib/python3.6/optparse.py in error(self, msg) 1567 """ 1568 self.print_usage(sys.stderr) -> 1569 self.exit(2, "%s: error: %s\n" % (self.get_prog_name(), msg)) 1570 1571 def get_usage(self): //anaconda/lib/python3.6/optparse.py in exit(self, status, msg) 1557 if msg: 1558 sys.stderr.write(msg) -> 1559 sys.exit(status) 1560 1561 def error(self, msg): SystemExit: 2`
yskale commented 6 years ago

optparser is having problem, reinstall it.

orangewords commented 6 years ago

Thanks. That solved the problem, but now I am back to getting a syntax error when I try to run a test with the sample query. scholar.py --citation=bt -a "einstein" returns File "<ipython-input-7-46a87b6d443b>", line 1 scholar.py --citation=bt -a "einstein" ^ SyntaxError: invalid syntax

orangewords commented 6 years ago

Think I've got it working! Thanks again

hugues-talbot commented 6 years ago

This is again not working. I've applied the suggested changes, and the URL is malformed, e.g:

[ INFO] requesting citation data failed: HTTP Error 404: Not Found [ INFO] retrieving citation export data [ INFO] requesting http://scholar.google.com/https://scholar.googleusercontent.com/scholar.bib?q=info:kpSD9apcVf8J:scholar.google.com/&output=citation&scisig=AAGBfm0AAAAAWhmheR7o8h0e2pro0VQ3wwSoU5_DiIIu&scisf=4&ct=citation&cd=8&hl=en

Clearly the "http://scholar.google.com" is too much. Cutting+pasting the https://scholar.googleusercontent.com... works though. So not far.

jessebrennan commented 6 years ago

@hugues-talbot to fix this issue change https://github.com/ckreibich/scholar.py/blob/master/scholar.py#L515 to

if path.startswith('http://') or path.startswith('https://'):
SvennoNito commented 4 years ago

To maintain the functioning of the newline (\n) operator, I had to change line 1145 to

art.as_citation().decode("utf-8") + "\n" and NOT to print(art.as_citation() + "\n".encode('ascii'))

My system runs under Windows 10.