GEUS-Glaciology-and-Climate / pypromice

Process AWS data from L0 (raw logger) through Lx (end user)
https://pypromice.readthedocs.io
GNU General Public License v2.0
12 stars 4 forks source link

Modify time shift #69

Closed patrickjwright closed 1 year ago

patrickjwright commented 1 year ago

This addresses #68 and also comprehensively addresses the application of time shifting for any situation of file format and logger type. See the docstrings in _addTimeShift for further details.

I am not super familiar with going back and forth between xarray and pandas. You will see that I chose to do the work in _addTimeShift using pandas. Please double check my methods of returning back to xarray and re-assigning the attributes. It is pretty simple, but just not something I have done in the past.

Also, a general check of the logic depending on file format and logger type would be good. I embedded in the code with both tx v2 and tx v3 sample stations to confirm the use of concat and to confirm that the time-shifting is performing as intended.

patrickjwright commented 1 year ago

@PennyHow Wondering if I should remove the doy column from df_out before returning. This was added just to use within the function. But maybe doesn't matter if it goes along for the ride?

PennyHow commented 1 year ago

This is how we have converted pd.DataFrame objects to xr.Dataset objects previously, so that the time variable is correctly assigned as the index.

vals = [xr.DataArray(data=df[c], dims=['time'], coords={'time':df_d.index}, attrs=ds_h[c].attrs) for c in df_d.columns]
ds = xr.Dataset(dict(zip(df.columns,vals)), attrs=ds.attrs)

However, your one-liner could be a better alternative. Do you know if it correctly assigns the index to the time variable?

PennyHow commented 1 year ago

And also, do you intend to pass the instantaneous variables df_i forward with the time-shifted dataset? Would we need a separate, unshifted time variable for this? Is this what time_orig is for?

PennyHow commented 1 year ago

@PennyHow Wondering if I should remove the doy column from df_out before returning. This was added just to use within the function. But maybe doesn't matter if it goes along for the ride?

It doesn't matter as it will be removed in the aws.writeArr step where we drop irrelevant variables before exporting to csv and nc formats.

patrickjwright commented 1 year ago

@PennyHow time_orig is not used. The instantaneous (and GPS) values are passed forward. In each case, everything is concatenated together on the time axis, after shifting only the hourly average values. This results in a final dataframe with only some of the columns shifted.

PennyHow commented 1 year ago

@PennyHow time_orig is not used. The instantaneous (and GPS) values are passed forward. In each case, everything is concatenated together on the time axis, after shifting only the hourly average values. This results in a final dataframe with only some of the columns shifted.

Gotcha. That makes sense.

PennyHow commented 1 year ago

I've just been testing the changes. This line is problematic as some stations do not have the bat_v_ini parameter, producing an error.

https://github.com/GEUS-Glaciology-and-Climate/pypromice/blob/d7230d1967d7362bc347478c5babe00e18538941/src/pypromice/L0toL1.py#L261

Suggested change:

df_i = df.filter(items=i_cols, axis=1)  
try:
     df_i = df_i.drop(columns='batt_v_ini')
except:
     pass

However, I would like to add a column to our variables.csv look-up table that states whether a variable is instantaneous or averaged. Then we can refer to that in this step rather than looking for specific strings in the column names.

PennyHow commented 1 year ago

This is how we have converted pd.DataFrame objects to xr.Dataset objects previously, so that the time variable is correctly assigned as the index.

vals = [xr.DataArray(data=df[c], dims=['time'], coords={'time':df_d.index}, attrs=ds_h[c].attrs) for c in df_d.columns]
ds = xr.Dataset(dict(zip(df.columns,vals)), attrs=ds.attrs)

However, your one-liner could be a better alternative. Do you know if it correctly assigns the index to the time variable?

I just checked this. The time variable is correctly assigned as the index.

patrickjwright commented 1 year ago

@PennyHow see latest commit with following changes:

patrickjwright commented 1 year ago

Also, it turns out batt_v_ini is instantaneous, so dropping it from df_i would have been wrong in the first place! Any batt voltage or fan current is an instantaneous sample. Regardless, we now should be treating each variable as defined in variables.csv.