As a developer, I want to determine the viability of storing project output in the database - Githubissues

NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service

Other

2 stars 1 forks source link

As a developer, I want to determine the viability of storing project output in the database #176

Open epag opened 2 months ago

epag commented 2 months ago

Author Name: Chris (Chris) Original Redmine Issue: 45466, https://vlab.noaa.gov/redmine/issues/45466 Original Date: 2018-01-11

Anything that can be stored in csv files may be modeled and queried in the database. It's been discussed whether or not we should store project outputs in the database. Currently, we can easily store scalar results in the database (we would form the insert queries just like how we form the statements to write to the csv files). We are not currently saving vector results, so we'd need to wait to see how those will get modelled. Chances are that we'd be able to store those in the database as well.

Given the attached schema, we would be able to represent the output:

|. lid |. variablename |. earliestdate |. latestdate |. earliestlead |. latestlead |. metric |_. outputvalue |. unit_name | | DRRC2 | streamflow | 4713 BC | 294276 AD | 1 | 1 | MEAN SQUARE ERROR All data | 0.048554 | CMS | | DRRC2 | streamflow | 4713 BC | 294276 AD | 2 | 2 | MEAN SQUARE ERROR All data | 0.114108 | CMS | | DRRC2 | streamflow | 4713 BC | 294276 AD | 3 | 3 | MEAN SQUARE ERROR All data | 0.178154 | CMS | | DRRC2 | streamflow | 4713 BC | 294276 AD | 4 | 4 | MEAN SQUARE ERROR All data | 0.236825 | CMS | |\9. ... |

With the query:

SELECT  lid,
        variable_name,
        earliest_date,
        latest_date
        earliest_lead,
        latest_lead,
        metric,
        output_value,
        unit_name
FROM results.ProjectRun PR
INNER JOIN results.ProjectOutput PO
    ON PO.projectrun_id = PR.projectrun_id
INNER JOIN results.ProjectOutputValue POV
    ON POV.projectoutput_id = PO.projectoutput_id
INNER JOIN wres.VariableByFeature VBF
    ON VBF.variableposition_id = PO.variableposition_id
INNER JOIN wres.MeasurementUnit MU
    ON MU.measurementunit_id = PO.measurementunit_id
WHERE PR.projectrun_id = (
        SELECT R.projectrun_id
        FROM results.ProjectRun R
        INNER JOIN wres.Project RP
        ON RP.project_id = R.project_id
        WHERE RP.username LIKE '%::Chris'
            AND RP.success = true
            AND P.project_name = 'scenario001'
        ORDER BY end DESC, start DESC
    )
ORDER BY lid, variable_name, earliest_lead, latest_lead, earliest_date, latest_date;

Related issue(s): #260 Redmine related issue(s): 49298, 54731, 64642, 87132, 92406, 117578, 118831, 129095

epag commented 2 months ago

Original Redmine Comment Author Name: Chris (Chris) Original Date: 2018-01-11T22:03:02Z

Setting to low priority and TBD. Also thought that it may be useful to store version information with the project run.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2018-01-12T12:09:04Z

Couple of thoughts for when we get to this, eventually.

I think we'd want to separate out the data by all of the attributes we use to distinguish input data, because that is often how a query will be formed. In some cases, a user may want to obtain results from the database with a particular project identifier. Picture an OWP user in this case, or someone at an RFC that wants to retrieve outputs from one of their own projects. In other cases, they will want to probe what the database has without a project identifier using information they know about, such as the location id, variable id, period of record, metric name, forecast model name etc. Picture someone behind a web browser looking for verification outputs for services provided by the NWS Hydrology Program. These users will have no concept of the WRES, projects etc. They will just know about things like locations, variables, time periods, model names etc. for which they want to explore verification outputs.

Here's an ERD for the EVS database. We cannot use this directly, but it should give a sense of what I thought would be important to model w/r to the EVS outputs. The result_1d table basically stores the type of output we can currently generate with our WRES CSV output, i.e. verification scores. The result_2d table stores diagram output (something with a domian axis and a range axis). The result_3d table stores output with multiple range axis values for each domain axis value (like box plots). The schema may be horrible, I don't know - I'm a total novice with databases - but I offer it primarily as a mental model of the types of information we will need to distinguish. In some areas, we could start simpler, because the EVS has functionality that is not currently supported by WRES. In other areas, we will have to complicate slightly. For example, the EVS only pools by lead times. It cannot handle the more general pooling windows that the WRES allows. As a minimum, we would need to model some of the information that is currently stored in the result_1d, result_2d, and result_3d tables.

!evsdb.PNG!

epag commented 2 months ago

Original Redmine Comment Author Name: Chris (Chris) Original Date: 2018-01-12T15:24:33Z

Unless we nuke out all of the information involved in a project (i.e. the values in the source table and all of the time series data) the only things missing from the proposed model are some of the values in the result tables (like "result_upper_bound"). Is that bounding on how the results are presented or how they were formed? As far as how information was formed, I thought you were adamantly against storing any configuration metadata at all. I remember there being a pretty big debate over this about a year ago (the whole conversation about storing configuration data in the database vs the xml). If we do want to store the project configuration data, that would be really easy to form tables around.

At some point along the lines, can you give some conceptual definitions for some of these terms? I can assume what "result_upper_bound_probability" means in my own terms, but that's failed us in the past. Something of the form of:

would be perfect.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2018-01-12T15:36:55Z

Sorry, not meant to be a distraction. Just wanted to add some notes while I remembered. We can come back to this.

It seems as though you have a Metric called "MEAN SQUARE ERROR All data", for example. That is a metric and a threshold and would need to be broken out.

 As far as how information was formed, I thought you were adamantly against storing any configuration metadata at all.

Where is this question coming from...? I need to understand the sentence you're referring to, in order to respond usefully.

Yes, all these attributes can be explained in due course. Some of them won't be necessary to begin with. Some of those bounds you refer to are "measures of confidence" about the verification results. Say you have a mean error. You think it is 0.56, based on the sample data. But sample data is imperfect, so it's just a best guess, a central estimate if you like. You can also express confidence by, for example, saying that the mean error is between 0.43 and 0.63 with a central estimate of 0.56. The degree of confidence has a certain probability attached to it, like (0.05,0.95). In other words, there's a 5% chance the mean error could even be below 0.43, and a 5% chance it could be above 0.63. But the system cannot provide these bounds at present, and we probably don't need to worry about them for some time.

epag commented 2 months ago

Original Redmine Comment Author Name: Chris (Chris) Original Date: 2018-01-12T15:47:20Z

Awesome. I know exactly how we can expand on the prospective schema to get what we're looking for.

As for the being against storing configuration data, this was a hotly debated topic last February/March when I was selling the idea of using the database. I was for storing that configuration data in the database and only updating it on first run or change so that we could use the stored configuration data to drive logic without having to keep all of it in memory and reevaluating it on every run. At the time, you were pretty firmly against using the database for much of anything (other than for temporarily storing values for each execution, if I recall correctly).

If I have a period of time where I can tweak the model, I'll work on it a little bit. That may not be anytime soon, since this isn't a high priority. Thanks for providing the schema for the EVS model - that'll help with requirements.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2018-01-12T15:55:04Z

Right, sounds good.

No, I mean, where in my comments did I suggest storing project configuration?

I mean, I'm still against storing project configuration in the database, except in an execution log.

All I'm suggesting is that we can select on the relevant attributes of the project configuration and associated data (i.e. a "project") that fully defines the outputs. But it would not be attached to a specific project configuration. These are just attributes that define the outputs, like the time window used. It's just about getting the datamodel right, like we do for inputs (and a lot of the outputs share that stuff).

I'm guessing we have some crossed wires somewhere?

epag commented 2 months ago

Original Redmine Comment Author Name: Chris (Chris) Original Date: 2018-01-12T16:23:46Z

Most likely just cross wires. What I mean by storing the project configuration is storing the information explaining how we got the output we required - that is the project configuration. That is what defined how we achieved our results. Do you instead mean data that exists upon the MetricOutput objects?

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2018-01-12T16:34:30Z

Christopher.Tubbs wrote:

Most likely just cross wires. What I mean by storing the project configuration is storing the information explaining how we got the output we required - that is the project configuration. That is what defined how we achieved our results. Do you instead mean data that exists upon the MetricOutput objects?

Sounds like that's the cross. Yes.

To me: "project configuration" is an XML file that contains configuration, like we log in the execution log. It isn't the data or attributes of the data that comes from it, which is what we're interested in. A project drives it (and so it plays a role w/r to metadata), but it doesn't define it. Eventually, the datamodel for the verification output will need to look an awful lot like what the MetricOutput and MetricOutputMetadata can store, yes.

But the cross wires, I think, comes from me having a literal interpretation of project configuration as XML input. I don't want to store that XML in the database, except in an execution log.

OK?

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2018-09-13T20:06:24Z

Table definition and example spreadsheet are attached for the stats table used by Alex for ROE.

Thanks,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-08-21T14:07:05Z

From Chris (who authored the ticket and is still receiving emails):

I just got pinged on ticket #45466 due to an update from James on ticket #118831. I've played with SQLite and geopackage a good bit since leaving the WRES team and wow would SQLite be a fantastic output format. It'd blow the CSV away and make WRES output 1000x easier to use. ALL secondary operations on output would be easier to work with, too.

When I asked if the benefit would be primarily for Python user, he said it would likely be more general:

I'd say just about anything since your data would be in memory, typed, and out of shared processes (i.e. not conflicting with postgres) The CSV isn't typed and can make parsing a tad difficult.

He also recommended ORM if we pursue it further. Just thought I would post his feedback,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-08-21T14:18:50Z

I think sqlite is definitely something we could look at adding as an output format, good suggestion, especially if this is something that would increase use within owp. Unsure if you want to add this as a task linked to nextgen use for the upcoming reauth, Hank, but it could make sense.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-08-21T14:43:02Z

I have a meeting with Jay later today, but, currently, the text in the charter for NextGen/MaaS in the FY2024-2025 section says this:

NextGen/MaaS users. Once the evaluations to support comparison of model formulations are identified (we recommend APD work with ISED to identify those evaluations), the WRES will be enhanced, as needed, to support those evaluations, serving as an engine to compute needed statistics. Additionally, we will improve the usability of the WRES, sharing the code publicly and making it easier to incorporate within a microservice architecture. This will be completed through Agile development addressing requirements as identified, with no specific major milestone to be tracked.

Exact language is still being edited. SQLite would probably fit in the generic "making it easier to incorporate within a microservice architecture". I don't spell out specifics for that (as stated, there are no milestones), but I could. For example, the current server work from Evan would help with that.

I anticipate sharing that document for internal review later today or tomorrow. Thanks,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-11-21T17:48:56Z

Need for this arises again in the context of #117578 where there is a need to produce summary statistics across one or more pooling dimensions within an evaluation. For example, imagine a histogram that summarizes the distribution of skill scores across all locations for a given time window (e.g., lead time). These type of summary statistics are derived from many pools together, not one pool separately. Having a schema to organize the statistics for an evaluation and to query against arbitrary dimensions when producing summary statistics from them would help a lot.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-02-01T13:11:31Z

Increasing the priority of this ticket. It is only a matter of time before we encounter an evaluation where the statistics required for summary statistics calculation are larger than memory. Moreover, it would be really nice to generate statistics in a db format, like sqlite, and a statistics schema would facilitate that. The latter is essential if we want downstream users/integrators to easily generate custom products/graphics with wres statistics.