arXiv / arxiv-search

arXiv Search UI & APIs
https://arxiv.github.io/arxiv-search/
MIT License
96 stars 15 forks source link

Atom/XML Serializer #239

Closed JaimieMurdock closed 5 years ago

JaimieMurdock commented 5 years ago

Based on the recent work in arxiv-rss, I implemented an Atom+XML serializer for the arxiv-search API.

It adds to the work done by @Trumbore in arxiv/arxiv-rss#3:

Notes on implementation:

Testing

Fairly straightforward workflow:

From the VPN:

pipenv install
FLASK_APP=app.py FLASK_DEBUG=1 ELASTICSEARCH_HOST=127.0.0.1 pipenv run python create_index.py
FLASK_APP=app.py FLASK_DEBUG=1 ELASTICSEARCH_HOST=127.0.0.1 pipenv run python bulk_index.py
$ cd arxiv-auth
$ git pull
$ pipenv install ./users
$ JWT_SECRET=foosecret pipenv run python generate_token.py

Since we're pretty much only dealing with read permissions, it really doesn't matter if you use the defaults or something else in the script.

Some sample queries:

JaimieMurdock commented 5 years ago

I'm not crazy about the hack in search.controllers.api to add fields to support the Atom feed.

Agreed. Per our discussion around #234 (review), my understanding was that you would implement a different domain class to describe classic API requests, that was fitted more closely to the query-string model, and that this would then get handled as a query_string query in search.services.index. But that part may be outside the scope of this PR?

That sounds in-scope for the overall project, and an architectural decision to discuss. I chose to use the same class to describe classic API requests, the primary reason being feature parity between old and new APIs - classic_search does the migration by just renaming search_query to query. This gives us, essentially for free, support for the new API's field-specific query params. It also means we only have one place to handle interfacing with search.services.index in controllers - after classic requests are "translated" to the new-style in the controller, both are using search.controllers.api.search to get result sets.

Are we indexing affiliations?

https://github.com/arXiv/arxiv-search/blob/master/mappings/DocumentMapping.json#L226

I'll go through and find some examples that should have affiliations and use those for testing. I'm going to defer action on affiliations to ARXIVNG-2043.

The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.

I'm not familiar with that feature, but my initial reaction is that is sounds stateful and gross. @mhl10 would it rain fire if we said that we weren't going to support that feature anymore?

This is based on the example at https://arxiv.org/help/api - even running the example live doesn't result in a working URL though. Perhaps this should somehow be replaced with a url that reissues the query?

Also the <title> tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes.

      <title xmlns="http://www.w3.org/2005/Atom">ArXiv Query:
    search_query=all:electron&amp;id_list=&amp;start=0&amp;max_results=1</title>
      <id xmlns="http://www.w3.org/2005/Atom">http://arxiv.org/api/cHxbiOdZaP56ODnBPIenZhzg5f8</id>

I didn't implement an AtomSerializer.serialize_document() yet. I'm debating whether it makes sense to have a single-entry feed

If it isn't supported in the classic API, then no need. NotImplementedError looks like the right way to go to me.

As the id_list parameter also gets handled by search.controllers.api.classic_search, I'll leave that NotImplementedError in place.

I'm getting Content-Type: text/html; charset=utf-8 header on the response; at some point will want that to be Content-Type: application/atom+xml; charset=UTF-8. Might be out of scope, but just noticed it when I spun up the examples.

Definitely in scope, will fix.

erickpeirson commented 5 years ago

I'm not crazy about the hack in search.controllers.api to add fields to support the Atom feed.

Agreed. Per our discussion around #234 (review), my understanding was that you would implement a different domain class to describe classic API requests, that was fitted more closely to the query-string model, and that this would then get handled as a query_string query in search.services.index. But that part may be outside the scope of this PR?

That sounds in-scope for the overall project, and an architectural decision to discuss. I chose to use the same class to describe classic API requests, the primary reason being feature parity between old and new APIs - classic_search does the migration by just renaming search_query to query. This gives us, essentially for free, support for the new API's field-specific query params. It also means we only have one place to handle interfacing with search.services.index in controllers - after classic requests are "translated" to the new-style in the controller, both are using search.controllers.api.search to get result sets.

We discussed this on 12 March, and the outcomes of that discussion were:

erickpeirson commented 5 years ago

The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.

I'm not familiar with that feature, but my initial reaction is that is sounds stateful and gross. @mhl10 would it rain fire if we said that we weren't going to support that feature anymore?

This is based on the example at https://arxiv.org/help/api - even running the example live doesn't result in a working URL though. Perhaps this should somehow be replaced with a url that reissues the query?

Gotcha. I think that your suggestion is a good one -- just return the full URI for the current query. We will need to make it clear in documentation that this does not preserve the result set at the time of the original query (although hopefully that would be obvious).

erickpeirson commented 5 years ago

Also the tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes.</p> </blockquote> <p>Per <a href="https://validator.w3.org/feed/docs/rfc4287.html#element.title">RFC4287§4.2.14</a>:</p> <blockquote> <p>The "atom:title" element is a Text construct that conveys a human-readable title for an entry or feed.</p> </blockquote> <p>So as long as we are conveying the same information to a human reader, it's fine if it doesn't match the legacy API exactly.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/JaimieMurdock"><img src="https://avatars.githubusercontent.com/u/666240?v=4" />JaimieMurdock</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <blockquote> <blockquote> <p>Also the <title> tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes.</p> </blockquote> <p>Per <a href="https://validator.w3.org/feed/docs/rfc4287.html#element.title">RFC4287§4.2.14</a>:</p> <blockquote> <p>The "atom:title" element is a Text construct that conveys a human-readable title for an entry or feed.</p> </blockquote> <p>So as long as we are conveying the same information to a human reader, it's fine if it doesn't match the legacy API exactly.</p> </blockquote> <p>Can I take even more liberties with this and create something <em>actually</em> human-readable?</p> <p>Old:</p> <blockquote> <p>ArXiv Query: search_query=all:electron&id_list=&start=0&max_results=1</p> </blockquote> <p>New:</p> <blockquote> <p>arXiv Query: size: 50; terms: AND all=none; OR title=universes; include_fields: ['paper_id_v', 'paper_id', 'href', 'canonical', 'version', 'title', 'abstract', 'submitted_date', 'updated_date', 'comments', 'journal_ref', 'doi', 'primary_classification', 'secondary_classification', 'authors']</p> </blockquote> <p>Proposed:</p> <blockquote> <p>arXiv Search: all=none OR title=universes</p> </blockquote> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/erickpeirson"><img src="https://avatars.githubusercontent.com/u/3451594?v=4" />erickpeirson</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <blockquote> <blockquote> <blockquote> <p>Also the <title> tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes.</p> </blockquote> <p>Per <a href="https://validator.w3.org/feed/docs/rfc4287.html#element.title">RFC4287§4.2.14</a>:</p> <blockquote> <p>The "atom:title" element is a Text construct that conveys a human-readable title for an entry or feed.</p> </blockquote> <p>So as long as we are conveying the same information to a human reader, it's fine if it doesn't match the legacy API exactly.</p> </blockquote> <p>Can I take even more liberties with this and create something <em>actually</em> human-readable?</p> </blockquote> <p>Sounds reasonable to me!</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/erickpeirson"><img src="https://avatars.githubusercontent.com/u/3451594?v=4" />erickpeirson</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <p>@Trumbore @JaimieMurdock In your <code>extend_atom(self, atom_feed)</code> method, <code>atom_feed</code> is an ElementTree root. So a minimal way to remove the generator element would be:</p> <pre><code class="language-python">atom_feed.remove(atom_feed.find('./generator'))</code></pre> <p>You may want to make that a bit more robust by checking that <code>find()</code> actually returns an element. But that should get you started.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/mhl10"><img src="https://avatars.githubusercontent.com/u/746253?v=4" />mhl10</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <blockquote> <blockquote> <p>The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.</p> </blockquote> <p>I'm not familiar with that feature, but my initial reaction is that is sounds stateful and gross. @mhl10 would it rain fire if we said that we weren't going to support that feature anymore?</p> </blockquote> <p>The <code>id</code> field is part of the Atom spec and simply represents a universally unique identifier for the feed. I don't think we need generate the <code>id</code> value in exactly the same way as in classic, but we should support it. It could be something as simple as b64 encoding or md5sum or other form of deterministic encoding of the query params.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/Trumbore"><img src="https://avatars.githubusercontent.com/u/6631710?v=4" />Trumbore</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <p>@erickpeirson Thanks for that suggestion, I’ll definitely try it out. That same idea may allow me to add the attributions for the authors. Feedgen doesn’t provide that ability, but I see how I could extend the ElementTree structure for an author.</p> <p>From: Erick <a href="mailto:notifications@github.com">notifications@github.com</a> Sent: Thursday, April 4, 2019 11:43 AM To: arXiv/arxiv-search <a href="mailto:arxiv-search@noreply.github.com">arxiv-search@noreply.github.com</a> Cc: Ben Trumbore <a href="mailto:wbt3@cornell.edu">wbt3@cornell.edu</a>; Mention <a href="mailto:mention@noreply.github.com">mention@noreply.github.com</a> Subject: Re: [arXiv/arxiv-search] Atom/XML Serializer (#239)</p> <p>@Trumbore<a href="https://github.com/Trumbore">https://github.com/Trumbore</a> @JaimieMurdock<a href="https://github.com/JaimieMurdock">https://github.com/JaimieMurdock</a> In your extend_atom(self, atom_feed) method, atom_feed is an ElementTree root. So a minimal way to remove the generator element would be:</p> <p>atom_feed.remove(atom_feed.find('./generator'))</p> <p>You may want to make that a bit more robust by checking that find() actually returns an element. But that should get you started.</p> <p>— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<a href="https://github.com/arXiv/arxiv-search/pull/239#issuecomment-479951281">https://github.com/arXiv/arxiv-search/pull/239#issuecomment-479951281</a>, or mute the thread<a href="https://github.com/notifications/unsubscribe-auth/AGUxHp-97nO0XuZXVTUQ9dWrXrbf8NFmks5vdh1tgaJpZM4cZi-i">https://github.com/notifications/unsubscribe-auth/AGUxHp-97nO0XuZXVTUQ9dWrXrbf8NFmks5vdh1tgaJpZM4cZi-i</a>.</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>