Backend for html only search engines

cpitclaudel / biblio.el

Browse and import bibliographic references from CrossRef, DBLP, HAL, arXiv, Dissemin, and doi.org from Emacs

GNU General Public License v3.0

180 stars 14 forks source link

Backend for html only search engines #38

Open oatmealm opened 3 years ago

oatmealm commented 3 years ago

Looking at the API seems like it was meant for sites that return xml or json... is there an example of working maybe with css selectors directly on the html returned to a query when that's the only option?

cpitclaudel commented 3 years ago

Looking at the API seems like it was meant for sites that return xml or json

Not really; basically each backend is responsible for extracting the data and returning it in structured form.

is there an example of working maybe with css selectors directly on the html returned to a query when that's the only option?

There was https://github.com/cpitclaudel/biblio.el/pull/25/files , but it uses regexp, I think. You'd want to use libxml + some query selector engine (maybe https://github.com/zweifisch/enlive?) or direct recursion. I can help if you have a concrete example.

oatmealm commented 3 years ago

Hi. Thanks for the reply.

I'm looking at Israel's "Union List" (National Library), which seems to be a hosted Exlibris Primo site (I'm guessing). It's a convoluted and rather slow Angular based site, it seems. No API as far as I can tell.

Here's a sample query (in English):

http://merhav.nli.org.il/primo-explore/search?query=any,contains,postcolonial&tab=default_tab&search_scope=ULI&vid=ULI&lang=en_US&offset=0&fromRedirectFilter=true

Fiddling around I also found this "bare" query form: http://merhav.nli.org.il/primo_library/libweb/webservices/rest/primo-explore/v1/search.do?mode=Advanced&ct=AdvancedSearch

Would the Google Scholar example be easy to adapt in this case?

cpitclaudel commented 3 years ago

The API seems to be at 'http://merhav.nli.org.il/primo_library/libweb/webservices/rest/primo-explore/v1/pnxs, but it requires a cookie apparently.

Would the Google Scholar example be easy to adapt in this case?

I don't think so. This seems to be a dynamic website, s parsing the HTML won't give you anything, since it doesn't contain results. However, it should be possible to get the JSON returned by the API and used by the website. I would recommend writing to the website's authors at this point.

oatmealm commented 3 years ago

I was asking around. They had an hackathon few years back to test out an iiif based api but it seems it didn't go anywhere.

https://github.com/OriHoch/hackathon-tasks/issues/1

cpitclaudel commented 3 years ago

I see. I think you can ask about the current API though: clearly the website is a JavaScript program that downloads JSON data; you should be able to download that same JSON data from ELisp; you just need to figure out the exact query and headers, and they should be able to help with that, I think.