USGS-R / protoloads

Prototyping and exploring options for broad-scale load forecasting
0 stars 4 forks source link

trying to fix subsetting of the NWM nc files #26

Closed jzwart closed 6 years ago

jzwart commented 6 years ago

We were getting weird values when reading the streamflow data from the subsetted nc files. We looked back to the subset_nwm.R script and think we fixed part of the problem. The streamflow dimensions were being screwed up when put into the new_nc files. We now loop through the sites (and ref time if forecast) and use ncvar_put to insert streamflow data into new_nc as well as multiply by the scale factor. If not multiplied by the scale factor when data is put into the nc, the data read out will be 100x's too small. Double check our looping - we think we mirrored what you had but couldn't test it.

There were also 4 repeat COMIDS in the feature_id dimension. We're not sure why and did not know how to fix that. This is a secondary issue.

dblodgett-usgs commented 6 years ago

agh. I knew there would be a bug or two in what I did... I'll have a look tonight and get back to you.

aappling-usgs commented 6 years ago

thanks, @dblodgett-usgs ! i hope you're having a good trip

jzwart commented 6 years ago

yes! thanks @dblodgett-usgs !

dblodgett-usgs commented 6 years ago

OK, so there are potentially two issues here. One that is really rough related to scale factors and ncdf4 and another that I'm not 100% understanding related to the way dimensions were written (I've not addressed this issue in this comment).

The issue with ncdf4's handling of scale_factor may go deeper than just this script. When processing all this data, I used this: ncvar_put(new_nc, new_nc$var$streamflow, ncvar_get(nc, nc$var$streamflow)[keep])

If ncvar_get applies scale factor and ncvar_put does not, then (FML) the actual streamflow values get written to an integer netcdf variable -- which will cull precision to the nearest whole CMS!

That is with the first script I ran that is not in this repo. The next pass, in this repo, would be reading the data and applying the scale factor so you are getting data adjusted by a scale factor that should not be adjusted. -- I think.

I am rebuilding with the code you checked in now for the sake of doing it. Will look at this a bit more later on and get a PR in for my updated build stuff.

dblodgett-usgs commented 6 years ago

Serious 😿 over here.

I create this netcdf:

netcdf test_scale_factor {
dimensions:
    t = 1 ;
variables:
    int test(t) ;
        test:scale_factor = 0.1f ;
        test:units = "test_units" ;
data:

    test = 1 ;
}

I do:

> nc <- ncdf4::nc_open("test_scale_factor.nc", write = T)
> read_in <- ncdf4::ncvar_get(nc, "test")
> print(read_in)
[1] 0.1

So it applies the scale factor on the way in.
Now I do:

> ncdf4::ncvar_put(nc, "test", read_in)
> nc_close(nc)
> quit()
$ ncdump test_scale_factor.nc 
netcdf test_scale_factor {
dimensions:
    t = 1 ;
variables:
    int test(t) ;
        test:scale_factor = 0.1f ;
        test:units = "test_units" ;
data:

 test = 0 ;
}

Like I said, FML.

If I use: read_in <- ncdf4::ncvar_get(nc, "test", raw_datavals = TRUE) -- I get the raw data and things are 1:1 on round trip. So... this is a problem. I need to get back to Tuscaloosa and rerun the full extract for this stuff.

I'm not sure the dimension issue is real with the data I'm working against. Need to understand that better.

aappling-usgs commented 6 years ago

@dblodgett-usgs as noted in Jake's most recent commit message, we're going to merge this and isolate the changes we'd like you to think about next week in a separate, smaller PR.