ananelson / oacensus

http://ananelson.github.io/oacensus
Other
3 stars 1 forks source link

Incorporate polling and delay in OAG scraper? #9

Open cameronneylon opened 10 years ago

cameronneylon commented 10 years ago

The Open Article Gauge updates the results for a given POST request asynchronously. The immediately returned JSON will only include objects for which a license is already cached by OAG. The delay for obtaining license information for a large set of previously unseen DOIs can be substantial (minutes to hours).

Not sure how this should be managed in practice but some sort of delay or polling until the full set returns may be necessary. When fully populated the returned JSON object should include some information for every DOI. It may be effective to suggest to user to re-run the data gathering after a delay until full set is returned.

cameronneylon commented 10 years ago

An additional issue for very large sets of queries is that the scraper appears to overload the OAG service resulting in a null return raising a ValueError which isn't caught.

Seems easy enough to solve this by backing off and handling the failures gracefully but worth considering whether a solution to this issue can be tied to handling the polling so as to populate the full set of responses.

cameronneylon commented 10 years ago

I've got code that does this sort of now. Will send pull request in a bit. Policy/UI questions before that though.

  1. Is it better to timeout after a number of attempts to retrieve a result for a given doi or after a specified time
  2. For things that may take a while to retrieve a single record, should we provide a means for the user to break out when they feel they've got enough? If so what's the best means of doing that.