andycasey / ads

Python tool for ADS
MIT License
170 stars 72 forks source link

Incorporating highlights into the API #65

Closed jonnybazookatone closed 7 years ago

jonnybazookatone commented 8 years ago

The search engine has the capability of returning highlighted pieces of text for searches, for example:

q='abstract:"Gamma-ray burst"'

when requested, Solr will return the relevant highlighted text that resulted in the document:

  "highlighting": {
    "401713": {
      "abstract": [
        "The hypothesis on the <em>γ-ray burst</em> generation in the process of the collapse of surpermassive bodies"
      ]
    },
 ....
 }

This form is highlights: {"id": ["highlights requested, abstract, title, etc."]}. There are a few users that have requested access to this.

Proposed API

The highlights are query dependent, and so my first thought is to keep them connected to the SolrQuery class, and not within the Article, as then the Article class will have state related to its parent query, which it has no concept of. So you could foresee something as simple as:

class SearchQuery(object):
    def __init__(self):
        self._highlights = {}
    def __next___():
        self._highlights = response['response']['highlights']
    def highlights(self, article):
        return self._highlights.get(article.id, None)

and then you would access it via the API as:

>>> q = ads.SearchQuery(q='star', hl=True, hl_fl=['abstract'])
>>> p = list(q)
>>> a = p[0]
>>>
>>> q.highlights(a)
["The hypothesis on the <em>γ-ray burst</em> generation in the process of the collapse of surpermassive bodies"]
>>> 
>>> for article in p:
>>>    print 'bibcode: {}, query:{}, highlights: {}'.format(article.bibcode, q.query, q.highlights(article))

Alternative options are welcome, such as a highlights class that is filled and attached to the SearchQuery class., or something else smarter that retains the above prerequisites.

Issues with Article class

Just as an FYI. It would be weird to have something like:

>>> q = ads.SearchQuery(q='star', hl=True, hl_fl=['abstract'])
>>> p = list(q)
>>> p[0].highlights
>>> ["The hypothesis on the <em>γ-ray burst</em> generation in the process of the collapse of surpermassive bodies"]

as this article class could have many highlights depending on the query was, so you'd have to keep track of query and article.

andycasey commented 8 years ago

This would be a great improvement to the client!

Are there any other search terms (i.e., ones that aren't currently accessible from the client) that also have this kind of data structure returned? Accessing the results from SolrQuery rather than Article makes sense from a backend perspective, but if there are other fields that have a similar data structure then it might seem unusual to suggest "use the Article to access all attributes, except if you want X, Y, Z -- then access them from the SolrQuery and match up by Article.id".

An alternative scenario might be to have SearchQuery attach any highlights to the Articles as they are created and have a getter/setter for Article.highlights.

class SolrResponse(APIResponse):
    """
    Base class for storing a solr response
    """

    ...

    @property
    def articles(self):
        """
        articles getter
        """
        if self._articles is None:
            self._articles = []
            for doc in self.docs:
                # ensure all fields in the "fl" are in the doc to address
                # issue #38
                for k in set(self.fl).difference(doc.keys()):
                    doc[k] = None
                article = Article(**doc)
                article._highlights = self.response['highlights'].get(article.id, None)
                self._articles.append(article)
        return self._articles

I don't have a strong opinion on the best way it should be handled -- I was only trying to see if there were similar ways to implement it.

andycasey commented 8 years ago

@jonnybazookatone just to clarify: is the current server API capable of sending back the highlighted information? (e.g., would it be able to send back the information so that it is currently accessible by the client even through .response.response.json())

There are hacks occurring at #dotastro which would benefit from this if it were currently accessible from the server side.

jonnybazookatone commented 8 years ago

Yes, it's currently available from the API. For example:

curl -H 'Authorization: Bearer:TOKEN' 'https://api.adsabs.harvard.edu/v1/search/query?q=star&fl=id&hl=true&hl.fl=title,abstract' | python -m json.tool

{
    "highlighting": {
        "1732456": {
            "abstract": [
                " in the early universe or in the ultra-dense core of neutron <em>stars.</em> The thermal radiation
 from the quarks"
            ]
        },
    ...
    "response": {
        "docs": [
            {
                "id": "1732456"
            },
    ...
        "numFound": 943061,
        "start": 0
    },
    "responseHeader": {
        "QTime": 252,
        "params": {
            "fl": "id",
            "hl": "true",
            "hl.fl": "title,abstract",
            "q": "star",
            "wt": "json"
        },
        "status": 0
    }
}
jonnybazookatone commented 8 years ago

To clarify a little: you need to pass hl=true to turn on highlights. Then you can pass hl.fl which are the highlight fields, the most useful being hl.fl=title,abstract,body. You will then see the response contains the snippet that resulted in this document being returned, by the world being surrounded by <em>word</em>.

If you have further questions, open a ticket in the ADS issues, otherwise this issues's gonna get to long :stuck_out_tongue:.

jonnybazookatone commented 7 years ago

I ended up using the initial approach due to time constraints, but I'm happy if it's replaced by your other suggestion. I'll close this ticket for now.

@aaccomazzi you may want to look at #90.

aaccomazzi commented 7 years ago

Looks good Jonny. FYI, there are more fields where highlights are supported. Some other useful ones are ack, aff and author (which allows one to find where in the list is the author you have been looking for).

jonnybazookatone commented 7 years ago

Good to know, I'll add those also.