Public API via XML, JSON, ...

t-wissmann commented 3 years ago

Is there some kind of public API that can be used to query the database? I would be interested for example to query the list of volumes (with IDs and names). For lmcs.episciences.org, I wrote a script that checks that the order of the papers in a given volume is correct and that the DOI reported by episciences matches the DOI in the PDF.

So far, this script extracts all the required information from the HTML page reported by the webserver, but of course it would be preferable to directly obtain the list of papers in a given volume or the list of volumes in a more direct format (e.g. xml or json). Is this possible at the moment? Or do other kinds of more direct/REST-like queries to the episciences platform exist?

Edit: it is not necessary that the api is publicly callable; it would already help us if authenticated users could ask such queries.

rtournoy commented 3 years ago

Not at the moment but we have an internal API used for statistics at the moment, and the idea is to open it at some point in the future, and expand it's usage. Do you need to list/access only published papers, or also papers that are not yet published?

t-wissmann commented 3 years ago

For many tasks, having access to published papers (and their volumes) would already help a lot. For some other tasks, we also need access to not-yet published papers (namely for tools during the publication process). But it would already help to have API access to data that is already publicly available anyway (via html).

t-wissmann commented 2 years ago

I think the API we need at LMCS can easily be accomplished. For example, the following api function would help us a lot. Note that all the examples below are read-only, because they only provide the data that is currently present in HTML pages.

Querying the list of all volumes, i.e. the information on the volumes page, in json format. This means that when querying
```
https://lmcs.episciences.org/rest/list-volumes?regular=true&empty=false
```
then it returns a json list of all regular non-empty volumes, together with their volume ids.
Similarly,
```
https://lmcs.episciences.org/rest/list-papers?status=16&volume=591
```
should return a list of paper ids that were published (status=16) in the specified volume 591. If the user is logged in and has the right permissions, it should also be possible to query the list of papers with other status IDs, e.g. status=4 (accepted), which would be the information already present on the manage articles page.
Paper metadata:
```
https://lmcs.episciences.org/rest/papers-info?id=PAPER_ID
```
should print the metadata (title, arxiv url, doi, authors, volume id, secondary volume ids, submission date, publication date,...) of the given paper that is listed in the HTML of the paper page https://lmcs.episciences.org/PAPER_ID. Of course, this request should only succeed if the paper is published.
Paper administration data:
```
https://lmcs.episciences.org/rest/administratepaper?id=PAPER_ID
```
This should be the same as the above 'paper metadata', but it should work for unpublished paper and of course under the assumption that the logged in user has the required administration rights -- just like it is already the case with the administratepaper pages.

What do you think about those kind of queries? If you prefer, I can create separate issues for these four (and possible further) examples :-)

rtournoy commented 2 years ago

What if we use the same URLs with a different header to trigger a JSON content? e.g. : curl -H "Accept: application/json" "https://lmcs.episciences.org/browse/regularissues"

a3nm commented 2 years ago

@rtournoy I think this is also fine!

t-wissmann commented 2 years ago

What if we use the same URLs with a different header to trigger a JSON content? e.g. : curl -H "Accept: application/json" "https://lmcs.episciences.org/browse/regularissues"

This would be perfect for us!

rtournoy commented 2 years ago

About [1] "Querying the list of all volumes", can you please try the live examples:

curl -H "Accept: application/json" "https://epijinfo.episciences.org/browse/regularissues"

curl -H "Accept: application/json" "https://epijinfo.episciences.org/browse/volumes"

curl -H "Accept: application/json" "https://epijinfo.episciences.org/browse/section"

t-wissmann commented 2 years ago

Thanks a lot! the output looks great! I'm only wondering whether the inclusion of the list of all papers for each volume might cause too much load on the server.

rtournoy commented 2 years ago

About [2] we have added: Volumes (only for published articles):

curl -H "Accept: application/json" "https://epijinfo.episciences.org/volume/view/id/3"

and Sections (only for published articles):

curl -H "Accept: application/json" "https://epijinfo.episciences.org/section/view/id/3

To get the volume and all articles, with all statuses, you can use: curl -H "Accept: application/json" "https://rvcode.episciences.org/volume/all/?id=3" e.g.: curl -H "Accept: application/json" "https://epijinfo.episciences.org/volume/all/id/3 Authentication and a matching role are required.

rtournoy commented 2 years ago

About [3]: We have a new public export format: e.g.: https://epijinfo.episciences.org/54/json

t-wissmann commented 2 years ago

thanks! this looks very nice! I'm looking forward for it in production :)

rtournoy commented 2 years ago

OK you can try it online with v1.0.23 To be continued...

TobiasKappe commented 1 year ago

Hey! At the end of our discussion on Zoom a few weeks back I promised I'd get back to you with a couple of points where we still have to manually parse HTML in our automation. It took me a while, but I've finally gotten around to looking at what one of our tools (LMCSBot) does, and here's what I found:

Right now, we query the AJAX endpoint administratepaper/list to get a list of articles of certain statuses. This works fairly well, except that we then also have to parse the HTML that is embedded inside this JSON to get the raw information out. Perhaps this endpoint could change to just return pure JSON, or maybe a different one can be made for this purpose?
While synchronizing its local view of the articles with Episciences, our tool also needs to know the arXiv paper ID and version number. This information does not seem to be included in the JSON returned by administratepaper/list, but is found in the <paper_id>/json interface that was added in January. This means that, in some cases, we need to issue two requests per article. Would it be possible to also expose these two fields in the administratepaper/list interface, or its full JSON successor?
Our automation needs to send email on behalf of the person using it, but uses a single account to interact with Episciences (whose password is not known to the user of the tool). Right now, we cope with this by temporarily setting the email address linked to this account to the one of the user through the profile page, but that's a bit of a hack. It would be great if the forms that trigger an email sent to the author of an article also allow one to configure the "reply-to" email header, so that this is not necessary.

Happy to discuss any of these further.

CCSDForge / episciences

Public API via XML, JSON, ... #37