USACE / cwms-data-api

Corps Water Management System RESTful Data Service
MIT License
11 stars 14 forks source link

Add Filtering Parameters for Greater Than, Less Than, and Order to Timeseries Endpoint #620

Open krowvin opened 4 months ago

krowvin commented 4 months ago

Districts have a need for their public water management websites to filter data (values) on the server before it is returned.

Suggest adding: Greater Than, Less Than, and Order (desc/asc) to the Timeseries Endpoint for the return values example: ?filter=greater-than&filter-value=25

Otherwise you find yourself getting full period of record data via the client and having to filter out data you do NOT need after the fetch.

This makes for a very slow query and longer processing to filter on the client.

I do not think this merits a v3 of the timeseries if it can backport to the existing codebase (i.e. url parameters to add functionality)

Some districts are running their own SQL queries in Cold Fusion. Having this feature implemented now would ensure they can release their website without the use of Cold Fusion (or any other backend processing scripts that effectively do this).

Other districts are using python to run SQL for filtering to create JSON files with this data.

rma-rripken commented 4 months ago

greater-than and less-than what? You saw the existing begin and end parameters?

krowvin commented 4 months ago

Filters for the actual return values (edited post to include this)

One thought would be something like (just as an example)

?filter=greater-than&filter-value=25

rma-rripken commented 4 months ago

Should be just a single parameter "exclude-value-below" or maybe "value-min". If we're going to add those we should probably add some sort of filters on the quality values too.

krowvin commented 4 months ago

Should be just a single parameter "exclude-value-below" or maybe "value-min".

But these do not play nice with future filtering and would not make sense for the asc / desc would they?

If we're going to add those we should probably add some sort of filters on the quality values too

If this is trivial sure, but I am sure you have other pressing tasks. If less is more for this it would benefit to keep it simple at the start, because you could also apply this to the dates too (asc/desc) instead of making the client do it... Which flirts with the scope line. In my opinion!

rma-rripken commented 4 months ago

I'm not sure what other filters you've seen or envision. The filters can't get too crazy or it won't be efficient to write them in the queries.

You mentioned greater-than and less-than. In my opinion I'd prefer a single query parameter for each and let users combine them.
value-min=25&value-max=30000

If you have filter=greater-than&filter-value=25 then you have to repeat the filter parameter again and also repeat the filter-value parameter. filter=greater-than&filter-value=25&filter=less-than&filter-value=30000

The server can track the parameter order and pair things up.

As for an asc/desc flag - there are other things I'd want to do first but its certainly possible.

DanielTOsborne commented 4 months ago

So, this came up in a meeting, and it's about the paradigm of "keep computations close to the data." CDA does not operate with that philosophy.

As I'm sure everyone is aware, it's considerably more efficient to do filtering and computations (like aggregates) on, or as close to as possible, the database. After which only return the results to the client. It's why all the major databases support various aggregate functions and operations internally. If CDA is supposed to replace DB access, it needs to support the same operations that databases support.

If you want to fetch POR data and compute yearly averages, for example, you have to select all the data, and stream it to your client, then perform the averages locally. This significantly increases bandwidth usage. It also requires processing power on the client, which is unknown/unpredictable. What might take one user 30 seconds to perform an operation, it might take another 5 minutes. Performing the operations on the server-side allows more consistent performance metrics.

Or, if you need to filter for validations. If I want to check 15-minute data and only operate on validated data (specific quality code), I have to select everything and stream it to my client, even if only 1 value out of 35040 has the quality flag I want.

rma-rripken commented 4 months ago

I think basic threshold filtering on the value and quality make alot of sense and shouldn't be hard to do. I'm not sure how more complicated things would work. Maybe there would be a new aggregates or computations end-point? Lots of CDA is focused on exposing existing "entities". I think what you are suggesting (with the aggregates and computations) is a way of creating new timeseries based on computations but using the existing querying interface to do it. It sort of makes sense to query for it but does it also make sense to DELETE it?

If offices or districts have existing queries that they want access to via CDA we could potentially add a custom SWT end-point and expose their handful of already written and high-performance queries.

adamkorynta commented 4 months ago

The timeseries API follows the same paradigm as the PL/SQL design where you get a value for every time point. The CWMS client applications take advantage of this by only holding onto the date of the first time step and ditching the rest (for regular/local regular data) since you can always calculate the next time step given the interval and offset. I would not recommend patching the /timeseries endpoint to start removing timesteps as that changes the design significantly and if you're swapping filtered values for "UNDEFINED" values I don't think you're saving much in ways of performance.

DanielTOsborne commented 4 months ago

It sort of makes sense to query for it but does it also make sense to DELETE it?

DELETE from timeseries WHERE value < 0

For removing bogus negative values. Mostly an example, but I don't think the CWMS API supports that without going outside that schema.

The timeseries API follows the same paradigm as the PL/SQL design where you get a value for every time point. The CWMS client applications take advantage of this by only holding onto the date of the first time step and ditching the rest (for regular/local regular data) since you can always calculate the next time step given the interval and offset. I would not recommend patching the /timeseries endpoint to start removing timesteps as that changes the design significantly and if you're swapping filtered values for "UNDEFINED" values I don't think you're saving much in ways of performance.

I think that's the problem. CDA is only mapping to the CWMS PL/SQL API, but is pushed as the alternative to traditional SQL access. However traditional SQL access allows users to combine the CWMS API with additional SQL operations, to generate the desired result, all without leaving the DB server. This is what is not possible anymore, so CDA is not a suitable replacement for SQL access.

EDIT: To it in another perspective (don't take it for being argumentative, it's not intended as such): CDA v1 endpoints return direct results from CWMS API equivalent methods. That was not sufficient, so 'you' (CDA developers, of whom I include myself), wrote custom queries and exposed that as the v2 endpoint.

What 'you' were able to take advantage of (combining CWMS API with SQL), is something that's been removed from us (CDA/DB users), without supplying an equivalent replacement.

krowvin commented 4 months ago

potentially add a custom SWT end-point

Or any district - I know SWT is not the only district with these, there is a discussion open for this very topic #393

Edit: But this would also be ideally for the one-off situations that are truly needed (I would say this thread would benefit all districts, and possibly A2W)

DanielTOsborne commented 4 months ago

potentially add a custom SWT end-point

Or any district - I know SWT is not the only district with these, there is a discussion open for this very topic #393

I think we probably need to check in with all districts, and see what operations they do in (PL/)SQL, and see if there's any common operations we can put in non-district specific endpoints.

rma-rripken commented 4 months ago

I think basic threshold filtering on the value and quality make alot of sense and shouldn't be hard to do.

I want to take this back. Its harder than I thought.

JeremyDKellett commented 4 months ago

+1 to another office that uses this (“this” = equivalent functionality to querying cwms_v_tsv for values less then or greater then 123 in a date range).

MikeNeilson commented 4 months ago

I think that's the problem. CDA is only mapping to the CWMS PL/SQL API, but is pushed as the alternative to traditional SQL access. However traditional SQL access allows users to combine the CWMS API with additional SQL operations, to generate the desired result, all without leaving the DB server. This is what is not possible anymore, so CDA is not a suitable replacement for SQL access.

This isn't actually the philosophy of the API. Currently we're mapping to the "API" because it's convient and there is some parity required not to redesign all of the existing apps. But we can certainly work to expand what CDA does. Currently we are limited because there is direct access to the database, so if we put too much business logic (like the required data consistency logic and auth) in CDA then we have to be really careful not to open the general schema up so much that anyone who can log in can incorrectly tweak the data.

But that's just "now". And it's not terribly difficult to work around given the web_user role we have now. I want to at least keep any writes within the existing API but CDA now has direct read access to any table so we can get some fairly complex queries going where we need them.

That said, CDA is for general usage, there's likely always going to be certain things that required direct access but hopefully that is limited and is why we will make sure that a non-zero number of water management staff, and some extra applications, will still have that direct access. CDA should just be "the go to" until a problem arises and then we discuss, like above, exactly what should or shouldn't go into it.

MikeNeilson commented 4 months ago

So, this came up in a meeting, and it's about the paradigm of "keep computations close to the data." CDA does not operate with that philosophy.

For me that's intentional. Though I don't think it's all or nothing. My point is that there's definitely a balance of easy of development and maintanance vs performance. E.g. do you really want to maintain the Top Con calculations in PL/SQL?

For simple filtering and aggregates I do think it makes sense to include. But even an average isn't actually simple. Are doing a simple arithmetic average or does your data need to use something like the trapezoid rule to better catch the proper area under a curve (e.g. any instantaneous rate of change data; like you know, water flow.)

Though if we could incorporate the OpenDCS code (to no duplicate the math, yet again) and keep the data cached outside the database then you know, that's not a terrible balance between maintenance and performance.