Closed JaimieMurdock closed 5 years ago
I'm not crazy about the hack in search.controllers.api to add fields to support the Atom feed.
Agreed. Per our discussion around #234 (review), my understanding was that you would implement a different domain class to describe classic API requests, that was fitted more closely to the query-string model, and that this would then get handled as a query_string query in
search.services.index
. But that part may be outside the scope of this PR?
That sounds in-scope for the overall project, and an architectural decision to discuss. I chose to use the same class to describe classic API requests, the primary reason being feature parity between old and new APIs - classic_search
does the migration by just renaming search_query
to query
. This gives us, essentially for free, support for the new API's field-specific query params. It also means we only have one place to handle interfacing with search.services.index
in controllers - after classic requests are "translated" to the new-style in the controller, both are using search.controllers.api.search
to get result sets.
Are we indexing affiliations?
https://github.com/arXiv/arxiv-search/blob/master/mappings/DocumentMapping.json#L226
I'll go through and find some examples that should have affiliations and use those for testing. I'm going to defer action on affiliations to ARXIVNG-2043.
The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.
I'm not familiar with that feature, but my initial reaction is that is sounds stateful and gross. @mhl10 would it rain fire if we said that we weren't going to support that feature anymore?
This is based on the example at https://arxiv.org/help/api - even running the example live doesn't result in a working URL though. Perhaps this should somehow be replaced with a url that reissues the query?
Also the <title>
tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes.
<title xmlns="http://www.w3.org/2005/Atom">ArXiv Query:
search_query=all:electron&id_list=&start=0&max_results=1</title>
<id xmlns="http://www.w3.org/2005/Atom">http://arxiv.org/api/cHxbiOdZaP56ODnBPIenZhzg5f8</id>
I didn't implement an AtomSerializer.serialize_document() yet. I'm debating whether it makes sense to have a single-entry feed
If it isn't supported in the classic API, then no need.
NotImplementedError
looks like the right way to go to me.
As the id_list
parameter also gets handled by search.controllers.api.classic_search
, I'll leave that NotImplementedError
in place.
I'm getting
Content-Type: text/html; charset=utf-8
header on the response; at some point will want that to beContent-Type: application/atom+xml; charset=UTF-8
. Might be out of scope, but just noticed it when I spun up the examples.
Definitely in scope, will fix.
I'm not crazy about the hack in search.controllers.api to add fields to support the Atom feed.
Agreed. Per our discussion around #234 (review), my understanding was that you would implement a different domain class to describe classic API requests, that was fitted more closely to the query-string model, and that this would then get handled as a query_string query in search.services.index. But that part may be outside the scope of this PR?
That sounds in-scope for the overall project, and an architectural decision to discuss. I chose to use the same class to describe classic API requests, the primary reason being feature parity between old and new APIs - classic_search does the migration by just renaming search_query to query. This gives us, essentially for free, support for the new API's field-specific query params. It also means we only have one place to handle interfacing with search.services.index in controllers - after classic requests are "translated" to the new-style in the controller, both are using search.controllers.api.search to get result sets.
We discussed this on 12 March, and the outcomes of that discussion were:
APIQuery
model does not work for the classic API, because it does not support the boolean operations required by the classic API.ClassicAPIQuery
that does not use FieldedSearchTerm
s and friends, but instead focuses on the nested query-string structure that can be easily translated from the API query and translated to the ElasticSearch query_string
query.The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.
I'm not familiar with that feature, but my initial reaction is that is sounds stateful and gross. @mhl10 would it rain fire if we said that we weren't going to support that feature anymore?
This is based on the example at https://arxiv.org/help/api - even running the example live doesn't result in a working URL though. Perhaps this should somehow be replaced with a url that reissues the query?
Gotcha. I think that your suggestion is a good one -- just return the full URI for the current query. We will need to make it clear in documentation that this does not preserve the result set at the time of the original query (although hopefully that would be obvious).
Also the
tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes.
Per RFC4287§4.2.14:
The "atom:title" element is a Text construct that conveys a human-readable title for an entry or feed.
So as long as we are conveying the same information to a human reader, it's fine if it doesn't match the legacy API exactly.
Also the
tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes. Per RFC4287§4.2.14:
The "atom:title" element is a Text construct that conveys a human-readable title for an entry or feed.
So as long as we are conveying the same information to a human reader, it's fine if it doesn't match the legacy API exactly.
Can I take even more liberties with this and create something actually human-readable?
Old:
ArXiv Query: search_query=all:electron&id_list=&start=0&max_results=1
New:
arXiv Query: size: 50; terms: AND all=none; OR title=universes; include_fields: ['paper_id_v', 'paper_id', 'href', 'canonical', 'version', 'title', 'abstract', 'submitted_date', 'updated_date', 'comments', 'journal_ref', 'doi', 'primary_classification', 'secondary_classification', 'authors']
Proposed:
arXiv Search: all=none OR title=universes
Also the
tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes. Per RFC4287§4.2.14:
The "atom:title" element is a Text construct that conveys a human-readable title for an entry or feed.
So as long as we are conveying the same information to a human reader, it's fine if it doesn't match the legacy API exactly.
Can I take even more liberties with this and create something actually human-readable?
Sounds reasonable to me!
@Trumbore @JaimieMurdock In your extend_atom(self, atom_feed)
method, atom_feed
is an ElementTree root. So a minimal way to remove the generator element would be:
atom_feed.remove(atom_feed.find('./generator'))
You may want to make that a bit more robust by checking that find()
actually returns an element. But that should get you started.
The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.
I'm not familiar with that feature, but my initial reaction is that is sounds stateful and gross. @mhl10 would it rain fire if we said that we weren't going to support that feature anymore?
The id
field is part of the Atom spec and simply represents a universally unique identifier for the feed. I don't think we need generate the id
value in exactly the same way as in classic, but we should support it. It could be something as simple as b64 encoding or md5sum or other form of deterministic encoding of the query params.
@erickpeirson Thanks for that suggestion, I’ll definitely try it out. That same idea may allow me to add the attributions for the authors. Feedgen doesn’t provide that ability, but I see how I could extend the ElementTree structure for an author.
From: Erick notifications@github.com Sent: Thursday, April 4, 2019 11:43 AM To: arXiv/arxiv-search arxiv-search@noreply.github.com Cc: Ben Trumbore wbt3@cornell.edu; Mention mention@noreply.github.com Subject: Re: [arXiv/arxiv-search] Atom/XML Serializer (#239)
@Trumborehttps://github.com/Trumbore @JaimieMurdockhttps://github.com/JaimieMurdock In your extend_atom(self, atom_feed) method, atom_feed is an ElementTree root. So a minimal way to remove the generator element would be:
atom_feed.remove(atom_feed.find('./generator'))
You may want to make that a bit more robust by checking that find() actually returns an element. But that should get you started.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/arXiv/arxiv-search/pull/239#issuecomment-479951281, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGUxHp-97nO0XuZXVTUQ9dWrXrbf8NFmks5vdh1tgaJpZM4cZi-i.
Based on the recent work in
arxiv-rss
, I implemented an Atom+XML serializer for the arxiv-search API.It adds to the work done by @Trumbore in arxiv/arxiv-rss#3:
<arxiv:affiliation>
tags has been added.Notes on implementation:
search.controllers.api
to add fields to support the Atom feed. It's too much fitting the data to the implementation, especially since we may want to add content negotiation later, in which case we won't want the extra fields, but also won't want to repeat the negotiation in the controller functions.search.process.transform._transform_author
does not do anything with affiliations. Are we indexing affiliations? If so, once this function is changed, then we'll gain support in the atom feeds. I've marked the spot with a# TODO:
arxiv-rss
does not implement any of the OpenSearch extensions. @Trumbore might find the plugin insearch.api.atom_extensions
to be useful.AtomSerializer.serialize_document()
yet. I'm debating whether it makes sense to have a single-entry feed, and whether to just cast the single document as aDocumentSet
and pass it into the existing serializer or whether it made sense to have a stripped downserialize_document
feed, like in theJSONSerializer
case.xmlns
attribute is not repeated on all Atom elements. This results in cleaner xml, smaller file sizes, and should still be standards compliant.Testing
Fairly straightforward workflow:
docker-compose build && docker-compose up
From the VPN:
Authorization
header for127.0.0.1:5000
via request.ly. The easiest way to generate a token is to use the generate_token.py script in arxiv-auth.Since we're pretty much only dealing with read permissions, it really doesn't matter if you use the defaults or something else in the script.
FLASK_APP=classic_api.py FLASK_DEBUG=1 ELASTICSEARCH_HOST=127.0.0.1 JWT_SECRET=foosecret pipenv run flask run
Some sample queries: