JabRef / jabref

Graphical Java application for managing BibTeX and biblatex (.bib) databases
https://devdocs.jabref.org
MIT License
3.63k stars 2.59k forks source link

Introduce Web-scraping inside JabRef #11093

Open koppor opened 7 months ago

koppor commented 7 months ago

Currently, our web search sends out search strings to API endpoints and then interprets the results. In other words: We have fetchers with API key and screen scraping. For the screen scapers, they mostly don't work. We should switch to a browser-based screen-scraping. Mostly because of CloudFlare.

JabRef should display the HTML page inside JabRef and offer scraping the citations directly from the page. Similar as BibDesk does.

316482562-b4a3d1e7-bd0a-4475-ae52-71120ae0d1fe 316482726-6a80130f-f920-44a4-8689-f420fa459226

Maybe the Java Chromium Embedded Framework (JCEF) helps. The test class https://github.com/chromiumembedded/java-cef/blob/master/java/tests/detailed/handler/RequestHandler.java seems to guide one to the usage.


The PR https://github.com/JabRef/jabref/pull/7075 attempted to display the Google Scholar captchas in JabRef. The PR was not completed. -- This issue says: Rewrite the fetchers not to use URLDownload, but JCEF.

Note that this is different from https://github.com/JabRef/jabref/issues/11093. There, a new UI is demanded.

Here, it should be allowed that the fetchers run stand-alone without user interaction.


Affected fetchers:

Sometimes, the API used. Then findFullText is the method handling HTML only.

Siedlerchr commented 7 months ago

Works now, was probably a temporary glitch

Siedlerchr commented 6 months ago

I checked the Bib Desk code: They basically use a Safari based View Control and use a simple XPath query to check for matching links in the document's dom. The parsing itself is very similar to our existing fetcher infrastructure. I experimented a bit with using javafx's WebView, while that can display websites and even captchas e.g. on google scholar, I was not yet able to get the correct DOM after clicking on some page. This would require some further testing.

koppor commented 5 months ago

Related work: https://github.com/HtmlUnit/htmlunit?tab=readme-ov-file#getting-started

ThiloteE commented 5 months ago

When it comes to scrapping, I have seen JSoup being mentioned a lot: https://jsoup.org/ See also https://stackoverflow.com/questions/2835505/how-to-scan-a-website-or-page-for-info-and-bring-it-into-my-program