adsabs / oracle_service

Service that suggests article of interest to the reader or advices and scores the similarity between two records
MIT License
2 stars 2 forks source link

oracle_service

Build Status Coverage Status

ADS Oracle Service

Short Summary

Service that suggests article of interest to the reader based on their history or based on other readers with similar interests’ current reads. Also, when articles published in arXiv and by publisher both become available this service advices and scores their similarity for possible merging. These services are mostly for use by ADS.

Setup (recommended)

$ virtualenv python
$ source python/bin/activate
$ pip install -r requirements.txt
$ pip install -r dev-requirements.txt
$ vim local_config.py # edit, edit

Testing

On your desktop run:

$ py.test

API

Make a POST request to get recommendation:

To get recommendation you do a POST request to the endpoint readhist:

https://api.adsabs.harvard.edu/v1/oracle/readhist

within the POST, payload can have the following optinal prameters function, sort, num_docs, cutoff_days, and top_n_reads.

For example payload of,

{"function": "similar", "sort": "entry_date", "num_docs": 5, "cutoff_days": 5, "top_n_reads": 10}

produces the query and returns:

{"bibcodes": "...", "query": "(similar(topn(10, reader:<reader>, entry_date desc)) entdate:[NOW-5DAYS TO *])"}

reader is a 16-digit anonymous user id that gets extracted from the session id, hence for curl, for this endpoint, use your web API token.

Hence, to call oracle service /readhist do

curl -H "Authorization: Bearer <your web API token>" -H "Content-Type: application/json" -X POST -d '{"function": "similar", "sort": "entry_date", "num_docs": 5, "cutoff_days": 5, "top_n_reads": 10}' https://api.adsabs.harvard.edu/v1/oracle/readhist

Make a GET request for recommendation with the function and reader params:

curl -H "Authorization: Bearer <your web API token>" -X GET https://api.adsabs.harvard.edu/v1/oracle/<function>/<reader>

note that reader needs to be known to use the GET, which is extracted from the session id of the POST and returned in the response.

Make a POST request to get the matched bibcode:

To find match for a publisher paper knowing metadata for arXiv paper and vice versa you do a POST request to the endpoint matchdoc

https://api.adsabs.harvard.edu/v1/oracle/matchdoc

within the POST, payload should have the following required prameters abstract or title, with author and year, and if there is a doi, it can be included.

To call oracle service /matchdoc do

curl -H "Authorization: Bearer <your API token>" -H "Content-Type: application/json" -X POST -d '{"abstract": "<abstract text>", "title": "<title text>", "author": <"comma separated author list">, "year": "<year>", "doi": "doi"}' https://api.adsabs.harvard.edu/v1/oracle/matchdoc

and the API then responds in JSON with query that was executed and the list of bibcodes with their corresponding confidence, label(matched) (ie, if the system considers the two bibcodes to be a match, 1, or not a match, 0) and similarity scores between the abstract, title, author, and year of the two bibcodes.

{"query": "...", "match": [{"source_bibcode": "...", "matched_bibcode": "...", "confidence": 0.9140091, "matched": 1, "scores": {"abstract": 0.78, "title": 0.93, "author": 1, "year": 1}}]}

Add records to the db (internal use only):

curl -H "Authorization: Bearer <your API token>" -X PUT https://api.adsabs.harvard.edu/v1/oracle/add -d @dataLinksRecordList.json -H "Content-Type: application/json"

where docMatchRecordList.json contains data in the format of protobuf structure DocMatchRecordList. Please see https://github.com/adsabs/ADSPipelineMsg/blob/master/specs/docmatch.proto for specification.

Note that add endpoint is being called from docmatching script, https://github.com/adsabs/docmatch_scripts/blob/master/to_oracle.py, where a tab delimited text file with 2 bibcode columns and an optional confidence score column in inputted, the data structure is then created and submitted to this endpoint.

Delete records to the db (internal use only):

curl -H "Authorization: Bearer <your API token>" -X DELETE https://api.adsabs.harvard.edu/v1/oracle/delete -d @docMatchRecordList.json -H "Content-Type: application/json"

Query db:

To query matched documents database, you do a POST request to the endpoint query:

curl -H "Authorization: Bearer <your API token>" -X POST https://api.adsabs.harvard.edu/v1/oracle/query

and the API then responds in JSON with list of matched bibcodes with their corresponding confidence value.

{"params": {"rows": 2000, "start": 0, "date_cutoff": "1972-01-01 00:00:00+00:00"}, "results": [["2021arXiv210312030S", "2021CSF...15311505S", 0.9922985], ...]}

Note that the parameters are optional and if omitted then the first 2000 records are returned. To specify a range of records include start and rows:

curl -H "Authorization: Bearer <your API token>" -H "Content-Type: application/json" -X POST -d '{"rows":2, "start":0}' https://api.adsabs.harvard.edu/v1/oracle/query

Also query can be filtered by recent number of days:

curl -H "Authorization: Bearer <your API token>" -H "Content-Type: application/json" -X POST -d '{"rows":2, "start":0, "days":3}' https://api.adsabs.harvard.edu/v1/oracle/query

returning

{"params": {"rows": 2, "start": 0, "days": 3, "date_cutoff": "2022-08-25 19:06:15.359667+00:00"}, "results": [...]}

Also Note that there is a limit of number of records per call, 2000, if rows is set to a larger value, still only 2000 records are returned.

Cleanup db (internal use only):

curl -H "Authorization: Bearer <your API token>" -X GET https://api.adsabs.harvard.edu/v1/oracle/cleanup"

List tmp bibcodes (internal use only):

curl -H "Authorization: Bearer <your API token>" -X GET https://api.adsabs.harvard.edu/v1/oracle/list_tmps"

List multi matches (internal use only):

curl -H "Authorization: Bearer <your API token>" -X GET https://api.adsabs.harvard.edu/v1/oracle/list_multis"

Maintainers

Golnaz