Web API based data sources normally have quota limits on the number of API calls you can make in a given period.
For SCOPUS this means:
Query granularity is monthly.
20,000 API calls per 7 day window per key.
Each API call returns a max of 25 results.
Each query can chain calls up to a maximum of 5,000 returned results.
For WoS:
Query granularity allows for ranges with daily precision.
Each API call returns a max of 100 results.
Unclear what the maximum number of returnable results is per query. They mostly stipulate rates of API hits.
"amount requested exceeds limit of per period" is determined by the licence agreement.
The optimisation problem for SCOPUS is well defined and observatory_platform.utils.ScheduleOptimiser solves for those constraints.
On the face of it, one solution to the WoS optimisation problem is to make a single API call to fetch the entire record. A test fetch indicates that you can pull a reasonable amount of data in a single query. It allowed me to pull almost 4 years worth (2017-1-1 to 2020-10-1) of Curtin Records for example (between 14,400 - 14,500 records). A single query would not be optimal if there are download interruptions and a lot of records, since it would necessitate a re-query. So perhaps it makes sense to solve the same minimisation problem from SCOPUS for WoS as well. However if the preferred strategy is to just do bulk downloads for larger periods, e.g., a few years at a time, this diminishes the usefulness of minimising API calls, since as a proportion of total calls, the larger the period ranges, the smaller the possible benefit of minimising API calls.
Timing of dag runs
Perhaps it makes sense to pull only full months worth of records to make optimisation more consistent. So for example if a dag run was on 15 Oct, it would treat 31 Sept as end date for the record pull (at least in the SCOPUS situation).
What needs doing
For each telescope,
Check database to see if there is previously optimised schedule and accompanying meta information.
If there is saved schedule, download it.
If there is a historic schedule, add the missing periods to it with fine granularity, e.g., each month. Otherwise generate a fresh schedule with fine granularity.
Download using this schedule.
Generate a new histogram if new schedule, or amend the cached one if adding to historic one.
Run full or partial (re)optimisation on the schedule.
Save new schedule and meta information in database.
Blocking issue
Database component is missing. From the discussions it sounds like it is earmarked for development in the near future by @jdddog.
Web API based data sources normally have quota limits on the number of API calls you can make in a given period.
For SCOPUS this means:
For WoS:
The optimisation problem for SCOPUS is well defined and observatory_platform.utils.ScheduleOptimiser solves for those constraints.
On the face of it, one solution to the WoS optimisation problem is to make a single API call to fetch the entire record. A test fetch indicates that you can pull a reasonable amount of data in a single query. It allowed me to pull almost 4 years worth (2017-1-1 to 2020-10-1) of Curtin Records for example (between 14,400 - 14,500 records). A single query would not be optimal if there are download interruptions and a lot of records, since it would necessitate a re-query. So perhaps it makes sense to solve the same minimisation problem from SCOPUS for WoS as well. However if the preferred strategy is to just do bulk downloads for larger periods, e.g., a few years at a time, this diminishes the usefulness of minimising API calls, since as a proportion of total calls, the larger the period ranges, the smaller the possible benefit of minimising API calls.
Timing of dag runs Perhaps it makes sense to pull only full months worth of records to make optimisation more consistent. So for example if a dag run was on 15 Oct, it would treat 31 Sept as end date for the record pull (at least in the SCOPUS situation).
What needs doing For each telescope,
Blocking issue Database component is missing. From the discussions it sounds like it is earmarked for development in the near future by @jdddog.