SolarArbiter / solarforecastarbiter-api

HTTP API and database schema for the Solar Forecast Arbiter
https://api.solarforecastarbiter.org
MIT License
10 stars 6 forks source link

Add obs data cloning job #324

Closed lboeman closed 2 years ago

lboeman commented 2 years ago

This still needs tests, but the basic functionality is there. Adds a job that copies data from one observation to another on a cron. Jobs are set up via the admin cli, and must be manually supplied with an appropriate cron string, which should be the interval length of the observation. Motivation is to support forecast trials where we want to supply observation data to a trial organization only for the period of the trial.

Expected functionality is:

Requires a job execution user with access to read_values on the copy_from observation and read_values and write_values on the copy_to.

@wholmgren Do you think there's any other functionality we should cover here? I'm still not sure this belongs with the API, but if we add it to core it might need a new job_helpers module or something. If there's nothing to change I'll go ahead with adding testing.

wholmgren commented 2 years ago

and must be manually supplied with an appropriate cron string, which should be the interval length of the observation.

This could introduce a lot of latency if we're not careful about the timing. For example, say a user posts hourly data at 1 minute past the hour, but we try to copy data on the hour. The data will be 59 minutes old when we do copy it. Am I missing something? Should we try to copy more frequently?

Fetch the last value of the copy_to observation, and fetch data for the copy_from observation from that timestamp+1minute to now.

In principle the data provider can modify existing values. In practice that probably doesn't occur very often. Perhaps most likely would be backfilling data caused by a system outage on their end or our end. Should we create another job to ensure consistency? Is it possible to quickly compute hashes of the time series so we know if we need to copy more than the last N points? Tell data providers that they should let us know if they backfill data so we can do something manually?

Reset quality flags so that API will accept upload.

Don't want to reset user flagged.

lboeman commented 2 years ago

This could introduce a lot of latency if we're not careful about the timing. For example, say a user posts hourly data at 1 minute past the hour, but we try to copy data on the hour. The data will be 59 minutes old when we do copy it. Am I missing something? Should we try to copy more frequently?

Good point. I'll add a note to the cli that interval length is a good starting point for short intervals, but data providers posting schedule should be considered for long interval data.

In principle the data provider can modify existing values. In practice that probably doesn't occur very often. Perhaps most likely would be backfilling data caused by a system outage on their end or our end. Should we create another job to ensure consistency? Is it possible to quickly compute hashes of the time series so we know if we need to copy more than the last N points? Tell data providers that they should let us know if they backfill data so we can do something manually?

I think at least for now, we should do this manually and tell data providers to let us know to backfill data. It seems a little tricky to do, and has the potential to use a lot of overhead for short-interval observations over long periods, for example trying to fetch a long period of minute-resolution data every minute. I think that with a lot of those, we could easily keep a worker spinning on just keeping observations in sync.

I'll add a manual copy command to help facilitate this.

Don't want to reset user flagged.

Good catch, I will update this.

lboeman commented 2 years ago

I'm going to walk back adding an additional helper for backfilling data here. It seems like something that should be added to core's solararbiter cli if really needed at all. Because the admin cli works with scheduled jobs, and this would be a one-off command where the overhead of storing refresh tokens and such would make it more complicated than it needs to be. A script to do the cloning would be something like:

from solarforecastarbiter.io import APISession, request_cli_access_token

copy_from = <uuid>
copy_to = <uuid>
token = request_cli_access_token(<username>, <password>)
session = APISession(token)

values = session.get_observation_values(copy_from, <start>, <end>)
values['quality_flag'] = values['quality_flag'] & 1 # maintain user flagged 
session.post_observation_values(copy_to, values)

If that's something that would be good to have as a cli command I can make a core PR. I think this one should be ready for a review.

wholmgren commented 2 years ago

Sounds good. Can you move that comment to an issue in core? Unsure if we'll do it but at least it won't be lost.

wholmgren commented 2 years ago

Looks close!