Modify time shift - Githubissues

GEUS-Glaciology-and-Climate / pypromice

Process AWS data from L0 (raw logger) through Lx (end user)

https://pypromice.readthedocs.io

GNU General Public License v2.0

12 stars 4 forks source link

Modify time shift #69

Closed patrickjwright closed 1 year ago

patrickjwright commented 1 year ago

This addresses #68 and also comprehensively addresses the application of time shifting for any situation of file format and logger type. See the docstrings in _addTimeShift for further details.

I am not super familiar with going back and forth between xarray and pandas. You will see that I chose to do the work in _addTimeShift using pandas. Please double check my methods of returning back to xarray and re-assigning the attributes. It is pretty simple, but just not something I have done in the past.

Also, a general check of the logic depending on file format and logger type would be good. I embedded in the code with both tx v2 and tx v3 sample stations to confirm the use of concat and to confirm that the time-shifting is performing as intended.

patrickjwright commented 1 year ago

@PennyHow Wondering if I should remove the doy column from df_out before returning. This was added just to use within the function. But maybe doesn't matter if it goes along for the ride?

PennyHow commented 1 year ago

This is how we have converted pd.DataFrame objects to xr.Dataset objects previously, so that the time variable is correctly assigned as the index.

vals = [xr.DataArray(data=df[c], dims=['time'], coords={'time':df_d.index}, attrs=ds_h[c].attrs) for c in df_d.columns]
ds = xr.Dataset(dict(zip(df.columns,vals)), attrs=ds.attrs)

However, your one-liner could be a better alternative. Do you know if it correctly assigns the index to the time variable?

PennyHow commented 1 year ago

And also, do you intend to pass the instantaneous variables df_i forward with the time-shifted dataset? Would we need a separate, unshifted time variable for this? Is this what time_orig is for?

PennyHow commented 1 year ago

@PennyHow Wondering if I should remove the doy column from df_out before returning. This was added just to use within the function. But maybe doesn't matter if it goes along for the ride?

It doesn't matter as it will be removed in the aws.writeArr step where we drop irrelevant variables before exporting to csv and nc formats.

patrickjwright commented 1 year ago

@PennyHow time_orig is not used. The instantaneous (and GPS) values are passed forward. In each case, everything is concatenated together on the time axis, after shifting only the hourly average values. This results in a final dataframe with only some of the columns shifted.

PennyHow commented 1 year ago

@PennyHow time_orig is not used. The instantaneous (and GPS) values are passed forward. In each case, everything is concatenated together on the time axis, after shifting only the hourly average values. This results in a final dataframe with only some of the columns shifted.

Gotcha. That makes sense.

PennyHow commented 1 year ago

I've just been testing the changes. This line is problematic as some stations do not have the bat_v_ini parameter, producing an error.

https://github.com/GEUS-Glaciology-and-Climate/pypromice/blob/d7230d1967d7362bc347478c5babe00e18538941/src/pypromice/L0toL1.py#L261

Suggested change:

df_i = df.filter(items=i_cols, axis=1)  
try:
     df_i = df_i.drop(columns='batt_v_ini')
except:
     pass

However, I would like to add a column to our variables.csv look-up table that states whether a variable is instantaneous or averaged. Then we can refer to that in this step rather than looking for specific strings in the column names.

PennyHow commented 1 year ago

This is how we have converted pd.DataFrame objects to xr.Dataset objects previously, so that the time variable is correctly assigned as the index.
vals = [xr.DataArray(data=df[c], dims=['time'], coords={'time':df_d.index}, attrs=ds_h[c].attrs) for c in df_d.columns]
ds = xr.Dataset(dict(zip(df.columns,vals)), attrs=ds.attrs)
However, your one-liner could be a better alternative. Do you know if it correctly assigns the index to the time variable?

I just checked this. The time variable is correctly assigned as the index.

patrickjwright commented 1 year ago

@PennyHow see latest commit with following changes:

I added an instantaneous_hourly boolean column to variables.csv. Added as the next-to-last field to retain comments as the last field. I added an entry for every variable, where True indicates the variable is instantaneous at hourly sampling, and False indicates it is an hourly average. This specifically refers to hourly sampling (i.e. t_u is considered instantaneous at 10-minute samples, but is an average in the hourly STM or TX files). I got copies of the logger data table code for both v2 and v3 stations from Jakob in the workshop to determine if each variable uses the CRBasic Average() or Sample() to insert into the DataTable60min.
In _addTimeShift, df_u renamed to df_a, to better represent that this is the dataframe for averaged variables.
Added a for loop to add variable-specific attributes to the final ds_out xarray Dataset. I compared with your method from above to convert pd dataframe back to xarray, and this was the only difference. With adding the variable-specific attributes I tested that the two methods produce equivalent xarray datasets using xarray.Dataset.equals(). I think my new method using df_out.to_xarray() is a bit cleaner and probably more performant, even with this added for loop. The alternative method is included commented out.
modified the docstring comments

patrickjwright commented 1 year ago

Also, it turns out batt_v_ini is instantaneous, so dropping it from df_i would have been wrong in the first place! Any batt voltage or fan current is an instantaneous sample. Regardless, we now should be treating each variable as defined in variables.csv.