Open njmattes opened 8 years ago
ok. how should we deal with NULL values at this point, e.g. what if the point (39.75, 34.75) only has one non-null value 195 for "1980-01-01 00:00:00" and (39.25, 34.75) only has one non-value as well, say 190, for "1980-12-31 00:00:00".
should i return:
{ "data": [{ "geometry": {
"type": "Point",
"coordinates": [ 39.75, 34.75 ] },
"type": "Feature",
"properties": {
"values": [ 195, NULL ]}},
{ "geometry": {
"type": "Point",
"coordinates": [ 39.25, 34.75 ] },
"type": "Feature",
"properties": {
"values": [ NULL, 190 ]}}],
"metadata": {
"dates": [
"1980-01-01 00:00:00",
"1980-12-31 00:00:00"
]}}
or
{ "data": [{ "geometry": {
"type": "Point",
"coordinates": [ 39.75, 34.75 ] },
"dates": ["1980-01-01 00:00:00"] },
"type": "Feature",
"properties": {
"values": [ 195 ]}},
{ "geometry": {
"type": "Point",
"coordinates": [ 39.25, 34.75 ] },
"dates": [ "1980-12-31 00:00:00"] },
"type": "Feature",
"properties": {
"values": [ 190 ]}}],
}
In an aggregation null values should be ignored. So if you are averaging 35 grid cells in a given region for a given time and 5 of the values are null, the average should simply be calculated over the other 30 values ignore those 5. only if all 35 values are null should the resulting aggregated value be null. is that what you were asking?
different statistical operations would require different minimum numbers of non-null values. for example if you are calculating a standard deviation over a set of data, i think you would require at least 3 non-null values.
@joshuaelliott thanks for the clarification on the stats part. I think that @legendOfZelda question is more on if he would treat the nulls on the database, or pass this to the middleware. Anyway, it is a guidance for the stats calculations that we'll be performing.
@legendOfZelda I think that passing the nulls should be interesting, specially to reflect that a certain timeframe has nulls and represent it accordingly. @njmattes what do you think?
I'd pass the null values to the middleware as JavaScript's null
object. But if a centroid contains null values for each time step, you can avoid passing the centroid entirely.
On the front end, null
s are discarded as far as graphing, calculating, and visualizing are concerned.
got it @njmattes
In the new API responses on the newly created server, the pixels again aren't 'collapsed' so that multiple timesteps are contained in a single pixel. Ie, a single pixel appears 11k+ times in the dataset, each time with a single value. Instead they should appear only once, with a 11k+ long array of values.
In the aggregated responses there appear to be only single values rather than 11k+, one for each time step. Am I seeing that correctly?
Also, a new problem to tackle is the size of the time series in the response.metadata
. For AgMerra (and other datasets no doubt), there is a lot of duplicate information (the same year hundreds of times for instance), which increases the size of the response to a point where it becomes infeasible. There are many solutions but these come to mind:
Perhaps it's possible for the time series to store only the first time step, and the size of the delta? Then in the middleware we can unpack that into values for the front-end? Of course irregular time series, if we have any or want to support any, wouldn't work.
you're right, i haven't done it yet (except for temporal aggregation, naturally)! sorry about that, fixing it today along with cleaning up the code, in particular adding try-except's.
psims + agmerra have uniform time steps. however, gsde (soil data) doesn't have uniform depth steps. so we could assume a uniform delta for time but not for depth. @joshuaelliott can we assume the time steps are uniform? or if they're not uniform that at least it's not many timesteps?
by nesting you just mean: [(1980, [(1, [1, 2, ..., 31]), (2, [1, 2, ..., 30]), ..., (12, [1, 2, ..., 31])]), (1981, ...)]
i believe you can always assume uniform steps for time, yes.
depth dimensions generally don't have more than ~10 values. so probably you can just record depth explicitly? and then only treat time as this special case.
Sorry, I wasn't getting email notifications for this issue. Totally missed these comments.
Yes, by 'nesting' I meant something like you've got above @legendOfZelda.
I think depth and time can be treated the same way really, it's just that depth won't have multiple nested levels. Same for datasets like pAPSIM—with only 34 annual timesteps, it'd just be something like [(1979, 1980, 1982, ... )]
per Severin's example above.
Speaking of depth v. time v. other dimensions, we're also going to need to add a dimension name to the response metadata. So users know what they're looking at and graphing (have to label axes somehow).
@njmattes wget http://[IP]:5000/api/v0/griddata/dataset/1/var/1 should now return the right format, with, for every (lat,lon), the values in an array
Sorry, but the response from the dev server is still a bit off. Now each point has its own dates and different lengths of values. The dates should be in the metadata
, and each point should have the same amount of values, with null
values where necessary. Any point that contains all null
values should be dropped.
Perhaps details of the API are spread across too many threads at this point. I can start a fresh thread for an updated API discussion. Or maybe we need a better tool than github issues for sorting out the changes to the API? Postman is pretty rad, but its team-based features require subscriptions.
@Severin: A group by will result in the nulls being dropped unless your query account for one. As I see the query does not.
@Nate: Do you want the exact position of nulls or simply a count?
Regarding API spec, yes we need a better way to handle these changes.
Tanu
On May 24, 2016, at 9:27 AM, Nathan Matteson notifications@github.com wrote:
Sorry, but the response from the dev server is still a bit off. Now each point has its own dates and different lengths of values. The dates should be in the metadata, and each point should have the same amount of values, with null values where necessary. Any point that contains all null values should be dropped.
Perhaps details of the API are spread across too many threads at this point. I can start a fresh thread for an updated API discussion. Or maybe we need a better tool than github issues for sorting out the changes to the API? Postman is pretty rad, but its team-based features require subscriptions.
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub
@TanuMalik The null
values should be in the correct positions so that they align with the datetimes in the metadata
section.
yep, i'm rewriting the query. i'm adding a helper column to grid_dates that will help me locate where the non-NULL's are, expect to be done with it tonight.
@njmattes done, you can test, takes longer now though, expect 2 mins, 20s roughly
@njmattes can you please confirm the format if it is ok?
On May 25, 2016, at 5:28 PM, Severin Thaler notifications@github.com<mailto:notifications@github.com> wrote:
@njmatteshttps://github.com/njmattes done, you can test, takes longer now though, expect 2 mins, 20s roughly
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHubhttps://github.com/RDCEP/EDE/issues/34#issuecomment-221726939
I can have a look later today—I'm in class until 5ish.
Yep, this looks right. It runs too slow to actually hook up to Atlas though. The provisional Mongo backend is rounding the values to save space—that might help save time in the transfer. But not the execution itself I guess.
Unrelated to this, I notice that when I request /api/v0/griddata/dataset/1/var/1
, I see "region": [[-180, -90], [180, -90], [180, 90], [-180, 90], [-180, -90]]
. When I request from gridmeta
I get a polygon with "coordinates": [[[-179.75, -89.75], [179.75, -89.75], [179.75, 89.75], [-179.75, 89.75], [-179.75, -89.75]]]
. Is this expected?
I understand that the reason it is slow may be because we haven’t created indices. I don’t understand why that hasn’t been done. (I imagine that the database we are working with here is more complex than the MySQL on my laptop, but I create indices as a matter of course.) Is that planned?
On May 26, 2016, at 5:15 PM, Nathan Matteson notifications@github.com<mailto:notifications@github.com> wrote:
Yep, this looks right. It runs too slow to actually hook up to Atlas though. The provisional Mongo backend is rounding the values to save space—that might help save time in the transfer. But not the execution itself I guess.
Unrelated to this, I notice that when I request /api/v0/griddata/dataset/1/var/1, I see "region": [[-180, -90], [180, -90], [180, 90], [-180, 90], [-180, -90]]. When I request from gridmeta I get a polygon with "coordinates": [[[-179.75, -89.75], [179.75, -89.75], [179.75, 89.75], [-179.75, 89.75], [-179.75, -89.75]]]. Is this expected?
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHubhttps://github.com/RDCEP/EDE/issues/34#issuecomment-222010773
i did create indexes, all reasonable ones, including on the postgis raster field
im improving the query now + the python code. hope to cut down the RT considerably, say certainly below a minute and proceed from there
Interesting thanks!
On May 26, 2016, at 7:51 PM, Severin Thaler notifications@github.com<mailto:notifications@github.com> wrote:
i did create indexes, all reasonable ones, including on the postgis raster field
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/RDCEP/EDE/issues/34#issuecomment-222034186, or mute the threadhttps://github.com/notifications/unsubscribe/AC28rRKvvtLX5jewvIwIlo94J4nRQU3Fks5qFkAqgaJpZM4IPR_E.
@legendOfZelda If you need help looking at the python, vectorizing operations, or anything like that, let me know.
@legendOfZelda I was just checking the responses from the server and I notice it's down. I think just not listening on port 5000. Did you change the port—or is it just down for a bit? If there's a problem with the flask
I'm happy to take a look.
@njmattes was just experimenting with it, it's running again
@njmattes if you meant the .57 machine, im cleaning it up there now + restart, will let you know when up again.
@njmattes it's up and running on the .57 machine. let me know if there's any issue.
Instead of returning the following for two points with two values each,
we should be returning single points with all values for that point in a single array of values like