Associate griddata values by centroids

njmattes commented 8 years ago

Instead of returning the following for two points with two values each,

{ "data": [{ "geometry": {
  "type": "Point",
  "coordinates": [ 39.75, 34.75 ] },
  "type": "Feature",
  "properties": {
    "values": [ 195 ]}},
{ "geometry": {
  "type": "Point",
  "coordinates": [ 39.75, 34.75 ] },
  "type": "Feature",
  "properties": {
    "values": [ 197 ]}},
{ "geometry": {
  "type": "Point",
  "coordinates": [ 39.25, 34.75 ] },
  "type": "Feature",
  "properties": {
    "values": [ 191 ]}},
{ "geometry": {
  "type": "Point",
  "coordinates": [ 39.25, 34.75 ] },
  "type": "Feature",
  "properties": {
    "values": [ 190 ]}}],
"metadata": {
  "dates": [
    "1980-01-01 00:00:00",
    "1980-12-31 00:00:00"
  ]}}

we should be returning single points with all values for that point in a single array of values like

{ "data": [{ "geometry": {
  "type": "Point",
  "coordinates": [ 39.75, 34.75 ] },
  "type": "Feature",
  "properties": {
    "values": [ 195, 197 ]}},
{ "geometry": {
  "type": "Point",
  "coordinates": [ 39.25, 34.75 ] },
  "type": "Feature",
  "properties": {
    "values": [ 191, 190 ]}}],
"metadata": {
  "dates": [
    "1980-01-01 00:00:00",
    "1980-12-31 00:00:00"
  ]}}

ghost commented 8 years ago

ok. how should we deal with NULL values at this point, e.g. what if the point (39.75, 34.75) only has one non-null value 195 for "1980-01-01 00:00:00" and (39.25, 34.75) only has one non-value as well, say 190, for "1980-12-31 00:00:00".

should i return:

{ "data": [{ "geometry": {
  "type": "Point",
  "coordinates": [ 39.75, 34.75 ] },
  "type": "Feature",
  "properties": {
    "values": [ 195, NULL ]}},
{ "geometry": {
  "type": "Point",
  "coordinates": [ 39.25, 34.75 ] },
  "type": "Feature",
  "properties": {
    "values": [ NULL, 190 ]}}],
"metadata": {
  "dates": [
    "1980-01-01 00:00:00",
    "1980-12-31 00:00:00"
  ]}}

or

{ "data": [{ "geometry": {
  "type": "Point",
  "coordinates": [ 39.75, 34.75 ] },
  "dates": ["1980-01-01 00:00:00"] },
  "type": "Feature",
  "properties": {
    "values": [ 195 ]}},
{ "geometry": {
  "type": "Point",
  "coordinates": [ 39.25, 34.75 ] },
  "dates": [ "1980-12-31 00:00:00"] },
  "type": "Feature",
  "properties": {
    "values": [ 190 ]}}],
}

joshuaelliott commented 8 years ago

In an aggregation null values should be ignored. So if you are averaging 35 grid cells in a given region for a given time and 5 of the values are null, the average should simply be calculated over the other 30 values ignore those 5. only if all 35 values are null should the resulting aggregated value be null. is that what you were asking?

joshuaelliott commented 8 years ago

different statistical operations would require different minimum numbers of non-null values. for example if you are calculating a standard deviation over a set of data, i think you would require at least 3 non-null values.

ricardobarroslourenco commented 8 years ago

@joshuaelliott thanks for the clarification on the stats part. I think that @legendOfZelda question is more on if he would treat the nulls on the database, or pass this to the middleware. Anyway, it is a guidance for the stats calculations that we'll be performing.

@legendOfZelda I think that passing the nulls should be interesting, specially to reflect that a certain timeframe has nulls and represent it accordingly. @njmattes what do you think?

njmattes commented 8 years ago

I'd pass the null values to the middleware as JavaScript's null object. But if a centroid contains null values for each time step, you can avoid passing the centroid entirely.

On the front end, nulls are discarded as far as graphing, calculating, and visualizing are concerned.

ghost commented 8 years ago

got it @njmattes

njmattes commented 8 years ago

In the new API responses on the newly created server, the pixels again aren't 'collapsed' so that multiple timesteps are contained in a single pixel. Ie, a single pixel appears 11k+ times in the dataset, each time with a single value. Instead they should appear only once, with a 11k+ long array of values.

In the aggregated responses there appear to be only single values rather than 11k+, one for each time step. Am I seeing that correctly?

Also, a new problem to tackle is the size of the time series in the response.metadata. For AgMerra (and other datasets no doubt), there is a lot of duplicate information (the same year hundreds of times for instance), which increases the size of the response to a point where it becomes infeasible. There are many solutions but these come to mind:

Nest the time series information rather than using timestamp strings to avoid repetition
Send a coarser resolution of data in the initial response and then perhaps using a websocket fill in the gaps over time (similar to have a progressive JPEG loads)

njmattes commented 8 years ago

Perhaps it's possible for the time series to store only the first time step, and the size of the delta? Then in the middleware we can unpack that into values for the front-end? Of course irregular time series, if we have any or want to support any, wouldn't work.

ghost commented 8 years ago

you're right, i haven't done it yet (except for temporal aggregation, naturally)! sorry about that, fixing it today along with cleaning up the code, in particular adding try-except's.

psims + agmerra have uniform time steps. however, gsde (soil data) doesn't have uniform depth steps. so we could assume a uniform delta for time but not for depth. @joshuaelliott can we assume the time steps are uniform? or if they're not uniform that at least it's not many timesteps?

by nesting you just mean: [(1980, [(1, [1, 2, ..., 31]), (2, [1, 2, ..., 30]), ..., (12, [1, 2, ..., 31])]), (1981, ...)]

joshuaelliott commented 8 years ago

i believe you can always assume uniform steps for time, yes.

depth dimensions generally don't have more than ~10 values. so probably you can just record depth explicitly? and then only treat time as this special case.

njmattes commented 8 years ago

Sorry, I wasn't getting email notifications for this issue. Totally missed these comments.

Yes, by 'nesting' I meant something like you've got above @legendOfZelda.

I think depth and time can be treated the same way really, it's just that depth won't have multiple nested levels. Same for datasets like pAPSIM—with only 34 annual timesteps, it'd just be something like [(1979, 1980, 1982, ... )] per Severin's example above.

Speaking of depth v. time v. other dimensions, we're also going to need to add a dimension name to the response metadata. So users know what they're looking at and graphing (have to label axes somehow).

ghost commented 8 years ago

@njmattes wget http://[IP]:5000/api/v0/griddata/dataset/1/var/1 should now return the right format, with, for every (lat,lon), the values in an array

njmattes commented 8 years ago

Sorry, but the response from the dev server is still a bit off. Now each point has its own dates and different lengths of values. The dates should be in the metadata, and each point should have the same amount of values, with null values where necessary. Any point that contains all null values should be dropped.

Perhaps details of the API are spread across too many threads at this point. I can start a fresh thread for an updated API discussion. Or maybe we need a better tool than github issues for sorting out the changes to the API? Postman is pretty rad, but its team-based features require subscriptions.

TanuMalik commented 8 years ago

@Severin: A group by will result in the nulls being dropped unless your query account for one. As I see the query does not.

@Nate: Do you want the exact position of nulls or simply a count?

Regarding API spec, yes we need a better way to handle these changes.

Tanu

On May 24, 2016, at 9:27 AM, Nathan Matteson notifications@github.com wrote:

Sorry, but the response from the dev server is still a bit off. Now each point has its own dates and different lengths of values. The dates should be in the metadata, and each point should have the same amount of values, with null values where necessary. Any point that contains all null values should be dropped.

Perhaps details of the API are spread across too many threads at this point. I can start a fresh thread for an updated API discussion. Or maybe we need a better tool than github issues for sorting out the changes to the API? Postman is pretty rad, but its team-based features require subscriptions.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub

njmattes commented 8 years ago

@TanuMalik The null values should be in the correct positions so that they align with the datetimes in the metadata section.

ghost commented 8 years ago

yep, i'm rewriting the query. i'm adding a helper column to grid_dates that will help me locate where the non-NULL's are, expect to be done with it tonight.

ghost commented 8 years ago

@njmattes done, you can test, takes longer now though, expect 2 mins, 20s roughly

TanuMalik commented 8 years ago

@njmattes can you please confirm the format if it is ok?

On May 25, 2016, at 5:28 PM, Severin Thaler notifications@github.com<mailto:notifications@github.com> wrote:

@njmatteshttps://github.com/njmattes done, you can test, takes longer now though, expect 2 mins, 20s roughly

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHubhttps://github.com/RDCEP/EDE/issues/34#issuecomment-221726939

njmattes commented 8 years ago

I can have a look later today—I'm in class until 5ish.

njmattes commented 8 years ago

Yep, this looks right. It runs too slow to actually hook up to Atlas though. The provisional Mongo backend is rounding the values to save space—that might help save time in the transfer. But not the execution itself I guess.

Unrelated to this, I notice that when I request /api/v0/griddata/dataset/1/var/1, I see "region": [[-180, -90], [180, -90], [180, 90], [-180, 90], [-180, -90]]. When I request from gridmeta I get a polygon with "coordinates": [[[-179.75, -89.75], [179.75, -89.75], [179.75, 89.75], [-179.75, 89.75], [-179.75, -89.75]]]. Is this expected?

ianfoster commented 8 years ago

I understand that the reason it is slow may be because we haven’t created indices. I don’t understand why that hasn’t been done. (I imagine that the database we are working with here is more complex than the MySQL on my laptop, but I create indices as a matter of course.) Is that planned?

On May 26, 2016, at 5:15 PM, Nathan Matteson notifications@github.com<mailto:notifications@github.com> wrote:

Yep, this looks right. It runs too slow to actually hook up to Atlas though. The provisional Mongo backend is rounding the values to save space—that might help save time in the transfer. But not the execution itself I guess.

Unrelated to this, I notice that when I request /api/v0/griddata/dataset/1/var/1, I see "region": [[-180, -90], [180, -90], [180, 90], [-180, 90], [-180, -90]]. When I request from gridmeta I get a polygon with "coordinates": [[[-179.75, -89.75], [179.75, -89.75], [179.75, 89.75], [-179.75, 89.75], [-179.75, -89.75]]]. Is this expected?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHubhttps://github.com/RDCEP/EDE/issues/34#issuecomment-222010773

ghost commented 8 years ago

i did create indexes, all reasonable ones, including on the postgis raster field

ghost commented 8 years ago

im improving the query now + the python code. hope to cut down the RT considerably, say certainly below a minute and proceed from there

ianfoster commented 8 years ago

Interesting thanks!

On May 26, 2016, at 7:51 PM, Severin Thaler notifications@github.com<mailto:notifications@github.com> wrote:

i did create indexes, all reasonable ones, including on the postgis raster field

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/RDCEP/EDE/issues/34#issuecomment-222034186, or mute the threadhttps://github.com/notifications/unsubscribe/AC28rRKvvtLX5jewvIwIlo94J4nRQU3Fks5qFkAqgaJpZM4IPR_E.

njmattes commented 8 years ago

@legendOfZelda If you need help looking at the python, vectorizing operations, or anything like that, let me know.

njmattes commented 8 years ago

@legendOfZelda I was just checking the responses from the server and I notice it's down. I think just not listening on port 5000. Did you change the port—or is it just down for a bit? If there's a problem with the flask I'm happy to take a look.

ghost commented 8 years ago

@njmattes was just experimenting with it, it's running again

ghost commented 8 years ago

@njmattes if you meant the .57 machine, im cleaning it up there now + restart, will let you know when up again.

ghost commented 8 years ago

@njmattes it's up and running on the .57 machine. let me know if there's any issue.

RDCEP / EDE

Associate griddata values by centroids #34