Introduce Web-scraping inside JabRef

koppor commented 7 months ago

Currently, our web search sends out search strings to API endpoints and then interprets the results. In other words: We have fetchers with API key and screen scraping. For the screen scapers, they mostly don't work. We should switch to a browser-based screen-scraping. Mostly because of CloudFlare.

JabRef should display the HTML page inside JabRef and offer scraping the citations directly from the page. Similar as BibDesk does.

316482562-b4a3d1e7-bd0a-4475-ae52-71120ae0d1fe

316482726-6a80130f-f920-44a4-8689-f420fa459226

Maybe the Java Chromium Embedded Framework (JCEF) helps. The test class https://github.com/chromiumembedded/java-cef/blob/master/java/tests/detailed/handler/RequestHandler.java seems to guide one to the usage.

The PR https://github.com/JabRef/jabref/pull/7075 attempted to display the Google Scholar captchas in JabRef. The PR was not completed. -- This issue says: Rewrite the fetchers not to use URLDownload, but JCEF.

Note that this is different from https://github.com/JabRef/jabref/issues/11093. There, a new UI is demanded.

Here, it should be allowed that the fetchers run stand-alone without user interaction.

Affected fetchers:

ACS: org.jabref.logic.importer.fetcher.ACS
Google Scholar: org.jabref.logic.importer.fetcher.GoogleScholar)
Icar: org.jabref.logic.importer.fetcher.IacrEprintFetcher
JStor: org.jabref.logic.importer.fetcher.JstorFetcher
ResearchGate: org.jabref.logic.importer.fetcher.ResearchGate
ScienceDirect: org.jabref.logic.importer.fetcher.ScienceDirect
SpringerLink: org.jabref.logic.importer.fetcher.SpringerLink

Sometimes, the API used. Then findFullText is the method handling HTML only.

Siedlerchr commented 7 months ago

Works now, was probably a temporary glitch

Siedlerchr commented 6 months ago

I checked the Bib Desk code: They basically use a Safari based View Control and use a simple XPath query to check for matching links in the document's dom. The parsing itself is very similar to our existing fetcher infrastructure. I experimented a bit with using javafx's WebView, while that can display websites and even captchas e.g. on google scholar, I was not yet able to get the correct DOM after clicking on some page. This would require some further testing.

koppor commented 5 months ago

ThiloteE commented 5 months ago

When it comes to scrapping, I have seen JSoup being mentioned a lot: https://jsoup.org/ See also https://stackoverflow.com/questions/2835505/how-to-scan-a-website-or-page-for-info-and-bring-it-into-my-program

JabRef / jabref

Introduce Web-scraping inside JabRef #11093