JabRef / jabref

Graphical Java application for managing BibTeX and biblatex (.bib) databases
https://devdocs.jabref.org
MIT License
3.53k stars 2.47k forks source link

Google search issue in version 3.6 [Fixed in DevBuilds] #1886

Closed swamptromp closed 7 years ago

swamptromp commented 8 years ago

Hello,

Although JabRef 3.6 resolved issues with Google Search, I am still having problems. Google search worked for me for a few hours (I am trying to set up my entire library, so I have been using the search function a lot), but suddenly stopped fetching any search results. In the meantime, Springer and other searches continue to work - it sounds just like the 3.5 issues described on the forum. The first time this happened, I reinstalled JabRef 3.6, and Google search started working again. The problem came up again, though, and this time reinstalling isn't solving it. Is anyone else still having Google search issues? Any suggestions?

Thanks for your help!

// Edit by @matthiasgeiger: The problem with version 3.6 is fixed in the current development builds which are available at https://builds.jabref.org/master (for details see discussion below)

Siedlerchr commented 8 years ago

Hello @swamptromp, thanks for your report, I guess you ran into the Google limit. Google blocks your IP for a while, if you do too many automated requests (spam/bot protection) (as noted here) https://github.com/JabRef/jabref/issues/1694#issuecomment-238435978

A solution would be to use the Browser Addon JabFox to import the entries

There is currently nothing we can do and I am not sure if we could display a more specific dialog/error. However, I will create a new issue for that.

swamptromp commented 8 years ago

Ah, thanks so much for this @Siedlerchr! Great to know, glad it isn't a longer-term issue.

swamptromp commented 8 years ago

Hi, I have one more question. I verified my identity via the Google Scholar website, and although I can now get results through Google Scholar, JabRef still won't return google search requests, even after reinstalling it. Does this problem just go away after a set amount of time?

Siedlerchr commented 8 years ago

@swamptromp As it works for me, I can only try to give an explanation. I still think that your JabRef is blocked (uses a specific user agent) Your Google Scholar settings in the browser (e.g. the fact that you are successfully authorized) are stored in a cookie. However, JabRef does currently not support any form of authentication and therefore is not able to store your account info.

Maybe we can add this for the future which would resolve the problems a bit. In the meantime I would suggest using other fetchers or to manually import it. If you have a DOI or ISBN for a paper/book/..., try to use the DOI to Bibtex/ISBN fetchers, as they directly resolve the number to a bibtex entry.

tobiasdiez commented 8 years ago

@JabRef/developers did we tried to get clearance/allowance from google for the userargent = JabRef? This might be worth a try.

mlep commented 8 years ago

@JabRef/developers : The help about Google Scholar states that

To unblock your IP, do a Google scholar search in your browser.
You will be asked to show that you are not a robot (a CAPTCHA challenge).

Is this trick currently valid?

Siedlerchr commented 8 years ago

@tobiasdiez I remember that the User Agent previously was set to JabRef, but that led to the problem with then non utf-8 response. Maybe we should contact Google?

oscargus commented 8 years ago

@mlep Yes, I think so. It was added quite recently.

oscargus commented 8 years ago

@Siedlerchr That was solved by explicitly asking for UTF-8, see #1785

Tercus commented 7 years ago

I am running into the same problem. When I visit the google scholar page in my browser I have no problems (I get redirected to scholar.google.de though). When I copy the URL from the error log I DO get the message to prove that I am human. Even weirder is that my search term does not show up anywhere in the URL.

Error log (newly opened, searched for "test"):

java.io.IOException: Server returned HTTP response code: 503 for URL: https://ipv4.google.com/sorry/IndexRedirect?continue=https://scholar.google.com/scholar%3Fhl%3Den%26oe%3DASCII%26num%3D20%26as_sdt%3D2006&hl=en&q=CGMSBFhDZhcYrbCMvwUiGQDxp4NLfnqIFz9y9s9bYeUiRVwogv02XcI

    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1839) ~[?:1.8.0_60]

    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1440) ~[?:1.8.0_60]

    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254) ~[?:1.8.0_60]

    at net.sf.jabref.logic.net.URLDownload.downloadToString(URLDownload.java:123) ~[JabRef-3.6.jar:?]

    at net.sf.jabref.gui.importer.fetcher.GoogleScholarFetcher.runConfig(GoogleScholarFetcher.java:166) ~[JabRef-3.6.jar:?]

    at net.sf.jabref.gui.importer.fetcher.GoogleScholarFetcher.processQueryGetPreview(GoogleScholarFetcher.java:82) ~[JabRef-3.6.jar:?]

    at net.sf.jabref.gui.importer.fetcher.GeneralFetcher.lambda$actionPerformed$4(GeneralFetcher.java:191) ~[JabRef-3.6.jar:?]

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_60]

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_60]

    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_60]

  01:30:53.124 [JabRef CachedThreadPool] WARN  net.sf.jabref.gui.importer.fetcher.GoogleScholarFetcher - Error fetching from Google Scholar

java.io.IOException: Server returned HTTP response code: 503 for URL: https://ipv4.google.com/sorry/IndexRedirect?continue=https://scholar.google.com/scholar%3Fhl%3Den%26oe%3DASCII%26num%3D20%26as_sdt%3D2006&hl=en&q=CGMSBFhDZhcYrbCMvwUiGQDxp4NLfnqIFz9y9s9bYeUiRVwogv02XcI

    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1839) ~[?:1.8.0_60]

    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1440) ~[?:1.8.0_60]

    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254) ~[?:1.8.0_60]

    at net.sf.jabref.logic.net.URLDownload.downloadToString(URLDownload.java:123) ~[JabRef-3.6.jar:?]

    at net.sf.jabref.gui.importer.fetcher.GoogleScholarFetcher.runConfig(GoogleScholarFetcher.java:166) ~[JabRef-3.6.jar:?]

    at net.sf.jabref.gui.importer.fetcher.GoogleScholarFetcher.processQueryGetPreview(GoogleScholarFetcher.java:82) ~[JabRef-3.6.jar:?]

    at net.sf.jabref.gui.importer.fetcher.GeneralFetcher.lambda$actionPerformed$4(GeneralFetcher.java:191) ~[JabRef-3.6.jar:?]

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_60]

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_60]

    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_60]
stefan-kolb commented 7 years ago

That part should be your search term &q=CGMSBFhDZhcYrbCMvwUiGQDxp4NLfnqIFz9y9s9bYeUiRVwogv02XcI. Dunno why it is encoded in some way.

matthiasgeiger commented 7 years ago

At the moment I don't have that much time, but started to have a look at it. The GoogleScholarFetcher for fetching entries seems to be broken at the moment.

The runConfig() method produces the error - and without the configuration the results won't have the expected format (bibtex is not the default citation format but via JS some stuff is loaded).

Fixing the configuration thing looks complicated to me - but I'm not a JS expert and had not the time to investigate deeply what happens at submitting https://scholar.google.com/scholar_settings and how to emulate this using JabRef.

Another approach would be to dermine the "ID" of each shown article and than to call https://scholar.google.de/scholar?q=info:**ID_HERE**:scholar.google.com/&output=cite&scirp=0&hl=de to get the links for the bibtex format; E.g.: https://scholar.google.de/scholar?q=info:RExzBa3OlkQJ:scholar.google.com/&output=cite&scirp=0&hl=de

~~Someone wants to investigate this? Perhaps @JabRef/stupro? 😜 ~~

I'm working on it.

lenhard commented 7 years ago

A very controversial suggestion: The google scholar fetcher is a pain in the ass. If they change their format / interface every two weeks, it is not something you can reasonably program against. Maybe we should drop support for it.

matthiasgeiger commented 7 years ago

Okay... I got it back working. However, there is rather strict limit now implemented on the Google side so that is only possible to show and import the first 10 search results... (See #2082)

Fixed version is available at http://builds.jabref.org/fix-googlescholar/

koppor commented 7 years ago

Screen scraping is still the most popular method of doing EAI. Most popular reason is the lack of an API, as it is in our case. As long as we find someone updating the code, we should keep it. (Also refs #1833)

Siedlerchr commented 7 years ago

I would also say that Google Scholar is an important fetcher, which is used by many users.

mlep commented 7 years ago

I concur to stress that Google Scholar is (unfortunately for JabRef developers) a primary source of data: accessible for free and covering all scientific fields.

lenhard commented 7 years ago

I had expected no other reply :-)

We will continue to have to do weird hacks in the fetcher and manage constant questions from people when it breaks, but what choice do we have...

tobiasdiez commented 7 years ago

support Microsoft's Bing academics 😸 (they have at least an API)

Siedlerchr commented 7 years ago

@tobiasdiez it's literally death: http://blogs.nature.com/news/2014/05/the-decline-and-fall-of-microsoft-academic-search.html

tobiasdiez commented 7 years ago

Well, "what is dead may never die, but rises again, harder and stronger."

Conclusion In comparison to the Web of Science and Scopus, Microsoft Academic covers a far larger number of publications that are listed in Google Scholar and – importantly – covers all journal publications and books that are also covered in Google Scholar. This suggests that Microsoft Academic has excellent coverage of what are usually considered to be the most important academic outputs: journal articles and books.

http://www.harzing.com/download/mas.pdf

I think Microsoft changed the way how Bing Academics is feed with data. Previously, it was via an own crawler which was then suspended. But now it is connected to Bings main crawler and thus gets good and up-to-date data. Otherwise it wouldn't be listed as part of the congnitive services which until recently was a kind of test field or place for beta stage services and now moves on into "production".

Tercus commented 7 years ago

It would be nice to have a plugin-system for the whole search stuff. While it is nice to have a large choice of different search-APIs, most are unknown to people not working in that field and every change in the API requires a new version of Jabref. If the search-part would be using a plugin, then you could alter it by yourself and write your own. Also, you could split the maintenance of the search-APIs from the main project and just make it download them on demand...or something like that.

I also don't understand why some of the big players aren't added, such as WorldCat and many others

koppor commented 7 years ago

@Tercus I think, you often heard it in the context of open source projects, but I'll try to rephrase it for the context of JabRef. The JabRef team just consists of volunteers spending their free time for JabRef. They could finish their PhD or PostDoc phase, but they invest time in JabRef, because they just like to. There is no funding agency and the donations are not used for covering our living costs. They are also far from being enough to do so. They are also not enough to pay someone to do work we don't like to.

Regarding the plugins, we decided to drop support for it in the version 3.0. It was not, because plugins are bad per se, but increase our maintenance effort tremendously. We decided that reducing the amount of issues and having more (other) features in JabRef is more important. Moreover, having no plugin support assures that all functions in JabRef remains up to date with other JabRef code. Thus, changing internal data structures does not break any plugin, because we ensure that everything works during in internal change.

Having the code integrated in JabRef ensures that we do not rely on maintenance of third parties. The experience we have in JabRef is that people are working for JabRef and its plugins during their PhD and then move on to new things. Thus, it is not ensured that a plugin is maintained for a long time. Including it in JabRef really increases the probability that it is maintained.

Using TravisCI and offering all builds at https://builds.jabref.org/ ensures that fixes in the fetchers are available to the public as fast as possible.

We are also working on integrating all plugins into JabRef (see https://github.com/JabRef/jabref/issues/152). And we did that for the GVK Fetcher (https://github.com/JabRef/jabref/pull/378/).

Regarding WorldCat - the JabRef user @ChristopherHackett volunteered to work on that: https://github.com/JabRef/jabref/issues/1065.

Regarding other fetchers, I think, the answer is partially given in the first paragraph. If some maintenance work was put away from us (some hard tasks are listed in #111) would leave some time for us to do these things in JabRef. We also are aware of more than 500 feature requests from the old sourceforge tracker (see https://github.com/JabRef/jabref/wiki/FeatureRequests-Sorted).

What would help us, if someone would help maintaining our help pages (see https://github.com/JabRef/help.jabref.org/blob/gh-pages/CONTRIBUTING.md for a guide and https://github.com/JabRef/help.jabref.org/issues for a list of issues to start with), provide answers in the discourse forum, and transfer answers from discourse to our help page. Maybe this could be something for you to support JabRef?

Tercus commented 7 years ago

@koppor Thank you for your explanation. I'll try to give back to the project the best I can. I understand that a plugin system would be more work, but at the same time I think that it would easier for people like me to contribute to a plugin that is written in a script-language instead of having to figure out how JabRef works. It is quite daunting, to be honest. But I'll still try.

I work in social sciences and so far, none of the scrapper have been useful. Google-scholar worked until I got spam-banned because I was searching too often (apparently?). So right now I am stuck with having to use google scholar from my browser, search there and then import the bibtex code to jabref.

I'll try to get WorldCat running, maybe I'm lucky....

mlep commented 7 years ago

@tercus: Are you aware of jabFox? It is quite handy. See https://addons.mozilla.org/en-us/firefox/addon/jabfox/

matthiasgeiger commented 7 years ago

And the current development builds are also capable of searching google Scholar again (with the limitation of only showing the first 10 results of a query).

DesBw commented 7 years ago

Thank you...the new version has fixed it

stefan-kolb commented 7 years ago

@koppor wants to keep https://github.com/JabRef/jabref/issues/2173 open for now.