Closed patrickjwright closed 1 year ago
@PennyHow Wondering if I should remove the doy
column from df_out
before returning. This was added just to use within the function. But maybe doesn't matter if it goes along for the ride?
This is how we have converted pd.DataFrame
objects to xr.Dataset
objects previously, so that the time
variable is correctly assigned as the index.
vals = [xr.DataArray(data=df[c], dims=['time'], coords={'time':df_d.index}, attrs=ds_h[c].attrs) for c in df_d.columns]
ds = xr.Dataset(dict(zip(df.columns,vals)), attrs=ds.attrs)
However, your one-liner could be a better alternative. Do you know if it correctly assigns the index to the time
variable?
And also, do you intend to pass the instantaneous variables df_i
forward with the time-shifted dataset? Would we need a separate, unshifted time
variable for this? Is this what time_orig
is for?
@PennyHow Wondering if I should remove the
doy
column fromdf_out
before returning. This was added just to use within the function. But maybe doesn't matter if it goes along for the ride?
It doesn't matter as it will be removed in the aws.writeArr
step where we drop irrelevant variables before exporting to csv
and nc
formats.
@PennyHow time_orig
is not used. The instantaneous (and GPS) values are passed forward. In each case, everything is concatenated together on the time axis, after shifting only the hourly average values. This results in a final dataframe with only some of the columns shifted.
@PennyHow
time_orig
is not used. The instantaneous (and GPS) values are passed forward. In each case, everything is concatenated together on the time axis, after shifting only the hourly average values. This results in a final dataframe with only some of the columns shifted.
Gotcha. That makes sense.
I've just been testing the changes. This line is problematic as some stations do not have the bat_v_ini
parameter, producing an error.
Suggested change:
df_i = df.filter(items=i_cols, axis=1)
try:
df_i = df_i.drop(columns='batt_v_ini')
except:
pass
However, I would like to add a column to our variables.csv
look-up table that states whether a variable is instantaneous or averaged. Then we can refer to that in this step rather than looking for specific strings in the column names.
This is how we have converted
pd.DataFrame
objects toxr.Dataset
objects previously, so that thetime
variable is correctly assigned as the index.vals = [xr.DataArray(data=df[c], dims=['time'], coords={'time':df_d.index}, attrs=ds_h[c].attrs) for c in df_d.columns] ds = xr.Dataset(dict(zip(df.columns,vals)), attrs=ds.attrs)
However, your one-liner could be a better alternative. Do you know if it correctly assigns the index to the
time
variable?
I just checked this. The time
variable is correctly assigned as the index.
@PennyHow see latest commit with following changes:
instantaneous_hourly
boolean column to variables.csv
. Added as the next-to-last field to retain comments
as the last field. I added an entry for every variable, where True
indicates the variable is instantaneous at hourly sampling, and False
indicates it is an hourly average. This specifically refers to hourly sampling (i.e. t_u
is considered instantaneous at 10-minute samples, but is an average in the hourly STM or TX files). I got copies of the logger data table code for both v2 and v3 stations from Jakob in the workshop to determine if each variable uses the CRBasic Average()
or Sample()
to insert into the DataTable60min
._addTimeShift
, df_u
renamed to df_a
, to better represent that this is the dataframe for averaged variables.for
loop to add variable-specific attributes to the final ds_out
xarray Dataset. I compared with your method from above to convert pd dataframe back to xarray, and this was the only difference. With adding the variable-specific attributes I tested that the two methods produce equivalent xarray datasets using xarray.Dataset.equals()
. I think my new method using df_out.to_xarray()
is a bit cleaner and probably more performant, even with this added for
loop. The alternative method is included commented out.Also, it turns out batt_v_ini
is instantaneous, so dropping it from df_i
would have been wrong in the first place! Any batt voltage or fan current is an instantaneous sample. Regardless, we now should be treating each variable as defined in variables.csv
.
This addresses #68 and also comprehensively addresses the application of time shifting for any situation of file format and logger type. See the docstrings in
_addTimeShift
for further details.I am not super familiar with going back and forth between xarray and pandas. You will see that I chose to do the work in
_addTimeShift
using pandas. Please double check my methods of returning back to xarray and re-assigning the attributes. It is pretty simple, but just not something I have done in the past.Also, a general check of the logic depending on file format and logger type would be good. I embedded in the code with both tx v2 and tx v3 sample stations to confirm the use of
concat
and to confirm that the time-shifting is performing as intended.