JaimieMurdock commented 5 years ago

Based on the recent work in arxiv-rss, I implemented an Atom+XML serializer for the arxiv-search API.

It adds to the work done by @Trumbore in arxiv/arxiv-rss#3:

support for <arxiv:affiliation> tags has been added.

Notes on implementation:

I'm not crazy about the hack in search.controllers.api to add fields to support the Atom feed. It's too much fitting the data to the implementation, especially since we may want to add content negotiation later, in which case we won't want the extra fields, but also won't want to repeat the negotiation in the controller functions.
search.process.transform._transform_author does not do anything with affiliations. Are we indexing affiliations? If so, once this function is changed, then we'll gain support in the atom feeds. I've marked the spot with a # TODO:
arxiv-rss does not implement any of the OpenSearch extensions. @Trumbore might find the plugin in search.api.atom_extensions to be useful.
The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.
I didn't implement an AtomSerializer.serialize_document() yet. I'm debating whether it makes sense to have a single-entry feed, and whether to just cast the single document as a DocumentSet and pass it into the existing serializer or whether it made sense to have a stripped down serialize_document feed, like in the JSONSerializer case.
One change from the XML output of the current "classic" API is that the xmlns attribute is not repeated on all Atom elements. This results in cleaner xml, smaller file sizes, and should still be standards compliant.
Generating URLs for PDF links is not finalized.

Testing

Fairly straightforward workflow:

docker-compose build && docker-compose up

From the VPN:

pipenv install
FLASK_APP=app.py FLASK_DEBUG=1 ELASTICSEARCH_HOST=127.0.0.1 pipenv run python create_index.py
FLASK_APP=app.py FLASK_DEBUG=1 ELASTICSEARCH_HOST=127.0.0.1 pipenv run python bulk_index.py

Add a security token to the Authorization header for 127.0.0.1:5000 via request.ly. The easiest way to generate a token is to use the generate_token.py script in arxiv-auth.

$ cd arxiv-auth
$ git pull
$ pipenv install ./users
$ JWT_SECRET=foosecret pipenv run python generate_token.py

Since we're pretty much only dealing with read permissions, it really doesn't matter if you use the defaults or something else in the script.

Finally, to actually get the app going: FLASK_APP=classic_api.py FLASK_DEBUG=1 ELASTICSEARCH_HOST=127.0.0.1 JWT_SECRET=foosecret pipenv run flask run

Some sample queries:

http://127.0.0.1:5000/query?search_query=all:none (expect 3 results)
http://127.0.0.1:5000/query?search_query=all:none+OR+ti:universes (expect 9 results)
http://127.0.0.1:5000/query?search_query=all:none+AND+ti:universes (expect 0 results)
http://127.0.0.1:5000/query?search_query=all:none&id_list=1710.08254 (expect 1 result)
http://127.0.0.1:5000/query?id_list=1710.08254v1,1610.07526v3 (expect 2 results)

JaimieMurdock commented 5 years ago

I'm not crazy about the hack in search.controllers.api to add fields to support the Atom feed.

Agreed. Per our discussion around #234 (review), my understanding was that you would implement a different domain class to describe classic API requests, that was fitted more closely to the query-string model, and that this would then get handled as a query_string query in search.services.index. But that part may be outside the scope of this PR?

That sounds in-scope for the overall project, and an architectural decision to discuss. I chose to use the same class to describe classic API requests, the primary reason being feature parity between old and new APIs - classic_search does the migration by just renaming search_query to query. This gives us, essentially for free, support for the new API's field-specific query params. It also means we only have one place to handle interfacing with search.services.index in controllers - after classic requests are "translated" to the new-style in the controller, both are using search.controllers.api.search to get result sets.

Are we indexing affiliations?

https://github.com/arXiv/arxiv-search/blob/master/mappings/DocumentMapping.json#L226

I'll go through and find some examples that should have affiliations and use those for testing. I'm going to defer action on affiliations to ARXIVNG-2043.

The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.

I'm not familiar with that feature, but my initial reaction is that is sounds stateful and gross. @mhl10 would it rain fire if we said that we weren't going to support that feature anymore?

This is based on the example at https://arxiv.org/help/api - even running the example live doesn't result in a working URL though. Perhaps this should somehow be replaced with a url that reissues the query?

Also the <title> tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes.

      <title xmlns="http://www.w3.org/2005/Atom">ArXiv Query:
    search_query=all:electron&amp;id_list=&amp;start=0&amp;max_results=1</title>
      <id xmlns="http://www.w3.org/2005/Atom">http://arxiv.org/api/cHxbiOdZaP56ODnBPIenZhzg5f8</id>

I didn't implement an AtomSerializer.serialize_document() yet. I'm debating whether it makes sense to have a single-entry feed

If it isn't supported in the classic API, then no need. NotImplementedError looks like the right way to go to me.

As the id_list parameter also gets handled by search.controllers.api.classic_search, I'll leave that NotImplementedError in place.

I'm getting Content-Type: text/html; charset=utf-8 header on the response; at some point will want that to be Content-Type: application/atom+xml; charset=UTF-8. Might be out of scope, but just noticed it when I spun up the examples.

Definitely in scope, will fix.

erickpeirson commented 5 years ago

I'm not crazy about the hack in search.controllers.api to add fields to support the Atom feed.

Agreed. Per our discussion around #234 (review), my understanding was that you would implement a different domain class to describe classic API requests, that was fitted more closely to the query-string model, and that this would then get handled as a query_string query in search.services.index. But that part may be outside the scope of this PR?

That sounds in-scope for the overall project, and an architectural decision to discuss. I chose to use the same class to describe classic API requests, the primary reason being feature parity between old and new APIs - classic_search does the migration by just renaming search_query to query. This gives us, essentially for free, support for the new API's field-specific query params. It also means we only have one place to handle interfacing with search.services.index in controllers - after classic requests are "translated" to the new-style in the controller, both are using search.controllers.api.search to get result sets.

We discussed this on 12 March, and the outcomes of that discussion were:

The current APIQuery model does not work for the classic API, because it does not support the boolean operations required by the classic API.
We will define a new ClassicAPIQuery that does not use FieldedSearchTerms and friends, but instead focuses on the nested query-string structure that can be easily translated from the API query and translated to the ElasticSearch query_string query.

erickpeirson commented 5 years ago

The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.

I'm not familiar with that feature, but my initial reaction is that is sounds stateful and gross. @mhl10 would it rain fire if we said that we weren't going to support that feature anymore?

This is based on the example at https://arxiv.org/help/api - even running the example live doesn't result in a working URL though. Perhaps this should somehow be replaced with a url that reissues the query?

Gotcha. I think that your suggestion is a good one -- just return the full URI for the current query. We will need to make it clear in documentation that this does not preserve the result set at the time of the original query (although hopefully that would be obvious).

erickpeirson commented 5 years ago

Also the tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes.</p> </blockquote> <p>Per <a href="https://validator.w3.org/feed/docs/rfc4287.html#element.title">RFC4287§4.2.14</a>:</p> <blockquote> <p>The "atom:title" element is a Text construct that conveys a human-readable title for an entry or feed.</p> </blockquote> <p>So as long as we are conveying the same information to a human reader, it's fine if it doesn't match the legacy API exactly.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/JaimieMurdock"><img src="https://avatars.githubusercontent.com/u/666240?v=4" />JaimieMurdock</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <blockquote> <blockquote> <p>Also the <title> tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes.</p> </blockquote> <p>Per <a href="https://validator.w3.org/feed/docs/rfc4287.html#element.title">RFC4287§4.2.14</a>:</p> <blockquote> <p>The "atom:title" element is a Text construct that conveys a human-readable title for an entry or feed.</p> </blockquote> <p>So as long as we are conveying the same information to a human reader, it's fine if it doesn't match the legacy API exactly.</p> </blockquote> <p>Can I take even more liberties with this and create something <em>actually</em> human-readable?</p> <p>Old:</p> <blockquote> <p>ArXiv Query: search_query=all:electron&id_list=&start=0&max_results=1</p> </blockquote> <p>New:</p> <blockquote> <p>arXiv Query: size: 50; terms: AND all=none; OR title=universes; include_fields: ['paper_id_v', 'paper_id', 'href', 'canonical', 'version', 'title', 'abstract', 'submitted_date', 'updated_date', 'comments', 'journal_ref', 'doi', 'primary_classification', 'secondary_classification', 'authors']</p> </blockquote> <p>Proposed:</p> <blockquote> <p>arXiv Search: all=none OR title=universes</p> </blockquote> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/erickpeirson"><img src="https://avatars.githubusercontent.com/u/3451594?v=4" />erickpeirson</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <blockquote> <blockquote> <blockquote> <p>Also the <title> tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes.</p> </blockquote> <p>Per <a href="https://validator.w3.org/feed/docs/rfc4287.html#element.title">RFC4287§4.2.14</a>:</p> <blockquote> <p>The "atom:title" element is a Text construct that conveys a human-readable title for an entry or feed.</p> </blockquote> <p>So as long as we are conveying the same information to a human reader, it's fine if it doesn't match the legacy API exactly.</p> </blockquote> <p>Can I take even more liberties with this and create something <em>actually</em> human-readable?</p> </blockquote> <p>Sounds reasonable to me!</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/erickpeirson"><img src="https://avatars.githubusercontent.com/u/3451594?v=4" />erickpeirson</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <p>@Trumbore @JaimieMurdock In your <code>extend_atom(self, atom_feed)</code> method, <code>atom_feed</code> is an ElementTree root. So a minimal way to remove the generator element would be:</p> <pre><code class="language-python">atom_feed.remove(atom_feed.find('./generator'))</code></pre> <p>You may want to make that a bit more robust by checking that <code>find()</code> actually returns an element. But that should get you started.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/mhl10"><img src="https://avatars.githubusercontent.com/u/746253?v=4" />mhl10</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <blockquote> <blockquote> <p>The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.</p> </blockquote> <p>I'm not familiar with that feature, but my initial reaction is that is sounds stateful and gross. @mhl10 would it rain fire if we said that we weren't going to support that feature anymore?</p> </blockquote> <p>The <code>id</code> field is part of the Atom spec and simply represents a universally unique identifier for the feed. I don't think we need generate the <code>id</code> value in exactly the same way as in classic, but we should support it. It could be something as simple as b64 encoding or md5sum or other form of deterministic encoding of the query params.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/Trumbore"><img src="https://avatars.githubusercontent.com/u/6631710?v=4" />Trumbore</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <p>@erickpeirson Thanks for that suggestion, I’ll definitely try it out. That same idea may allow me to add the attributions for the authors. Feedgen doesn’t provide that ability, but I see how I could extend the ElementTree structure for an author.</p> <p>From: Erick <a href="mailto:notifications@github.com">notifications@github.com</a> Sent: Thursday, April 4, 2019 11:43 AM To: arXiv/arxiv-search <a href="mailto:arxiv-search@noreply.github.com">arxiv-search@noreply.github.com</a> Cc: Ben Trumbore <a href="mailto:wbt3@cornell.edu">wbt3@cornell.edu</a>; Mention <a href="mailto:mention@noreply.github.com">mention@noreply.github.com</a> Subject: Re: [arXiv/arxiv-search] Atom/XML Serializer (#239)</p> <p>@Trumbore<a href="https://github.com/Trumbore">https://github.com/Trumbore</a> @JaimieMurdock<a href="https://github.com/JaimieMurdock">https://github.com/JaimieMurdock</a> In your extend_atom(self, atom_feed) method, atom_feed is an ElementTree root. So a minimal way to remove the generator element would be:</p> <p>atom_feed.remove(atom_feed.find('./generator'))</p> <p>You may want to make that a bit more robust by checking that find() actually returns an element. But that should get you started.</p> <p>— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<a href="https://github.com/arXiv/arxiv-search/pull/239#issuecomment-479951281">https://github.com/arXiv/arxiv-search/pull/239#issuecomment-479951281</a>, or mute the thread<a href="https://github.com/notifications/unsubscribe-auth/AGUxHp-97nO0XuZXVTUQ9dWrXrbf8NFmks5vdh1tgaJpZM4cZi-i">https://github.com/notifications/unsubscribe-auth/AGUxHp-97nO0XuZXVTUQ9dWrXrbf8NFmks5vdh1tgaJpZM4cZi-i</a>.</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>

arXiv / arxiv-search

Atom/XML Serializer #239

Testing

Some sample queries: