COMPASS-DOE / sensor-data-pipeline

Sensor data workflows and processing scripts
MIT License
4 stars 0 forks source link

Incorporating non-datalogger data into processing pipeline #213

Open bpbond opened 1 month ago

bpbond commented 1 month ago

@Fausto2504 @nickdward request :

Ben and Steph, for the AquaTroll’s specifically we have had issues in the past with data not migrating from the troll to the Campbell logger. This is also the case for some of our YSI Exo sondes in the surface water. In this case, the data is generally stored internally on the sondes and we can recover it.

So in some cases we have a complete dataset downloaded from the sonde, but the L1 automated pipeline would not be retrieving this data. Do we want to address this at some stage in the pipeline, or would that perhaps be considered L2 data? In other words, at what stage of the pipeline would it be desirable to say “looks like loggernet missed all of 2022, let’s replace it with the data downloaded directly from the sonde”?

I want to think through in detail how this would work. As an example, here are the fields in a datalogger "WaterLevel600A" file versus the sample raw file:

Datalogger file AquaTROLL raw file
TIMESTAMP Date Time
Aquatroll_IDA(1) Device Id (in file header)
Barometric_Pressure600A 16: Barometer (16) mmHg (22)
Temperature600A 13: Temperature (1) °C (1)
Actual_Conductivity600A 7: Actual Conductivity (9) µS/cm (65)
Specific_Conductivity600A 8: Specific Conductivity (10) µS/cm (65)
Salinity600A 9: Salinity (12) PSU (97)
TDS600A 12: Total Dissolved Solids (13) ppt (114)
Water_Density600A 11: Water Density (14) g/cm³ (129)
Resistivity600A 10: Resistivity (11) ohm-cm (81)
pH600A 1: pH (17) pH (145)
pH_mV600A 2: pH mV (18) mV (162)
pH_ORP600A 3: ORP mV (19) mV (162)
RDO_concen600A 4: Dissolved Oxygen (20) mg/L (117)
RDO_perc_sat600A 5: Dissolved Oxygen (21) %sat (177)
RDO_part_Pressure600A 6: Partial Pressure Oxygen (30) Torr (26)
Pressure600A 17: Pressure (2) psi (17)
Depth600A 18: Depth (3) cm (34)
Voltage_Ext600A 14: External (32) V (163)
Battery_Int600A 15: Battery (33) % (241)
bpbond commented 1 month ago

Minor problem: timestamps

The datalogger (TIMESTAMP) and instrument (Date Time) timestamps won't be the same. We can't do anything about that, though.

Minor problem: flagging origin

Do we want a new F_DLG L1 flag indicating the origin of the data?

Major problem: ID matching

We need to match Device S/N = xxxxxxx in the Aquatroll600 file header to Compass_CRC_UP_303 or whatever, i.e. the datalogger site and plot code. This is easy but fragile to any changes. [edit: see @Fausto2504 comment below re google sheet.]

Major problem: name matching

We need to match column names from the raw instrument files to those used by @roylrich 's logger code. I can think of two ways to do this:

unnamed

roylrich commented 1 month ago

@bpbond, I would like to talk about this. Date time for AQ units should come in as separate variable from timestamp when connected. The issue, as I understand it is backfilling when we only have AQ data. Maybe we can calculate an offset between the two that creates the definitive timestamp (or new definitive timestamp) fot he dataset by filling from Loggernet timestamp unless missing then using other source plus offset? We have this issue for EXOsondes and other instruments so it wuld be worth getting a common strategy. My hunch is that we want to do it in L1 processing pipeline but not having a loggernet timestamp will be an issue for naming and checks

bpbond commented 1 month ago

Thanks, Roy, and I agree it would be good for you and me and @stephpenn1 to discuss.

Fausto2504 commented 1 month ago

@bpbond Are we doing a "pre-processing" to make the internal data as we want or getting ideas on how coding to match the columns with distinct names?

I think a flag of file origin is a good idea!

Not sure yet how to ID matching. An idea is using Device serial number. In row 6 of internall data we find "Device S/N = 848067" (note for coding: we have a unit with 7 digits number). We have a spredsheet we use for flagging when it is out of water that relate serial number and site and zone (location): https://docs.google.com/spreadsheets/d/1XbIt8gsOWaLBmpzUmnupKzt92GocHBMooD-MNyvijIY/edit?gid=0#gid=0

bpbond commented 1 month ago

@Fausto2504 ah THANK YOU, I had forgotten about that spreadsheet.

That is exactly what we will need! Serial number -> Site, plot, and troll A/B/C.

bpbond commented 1 month ago

Summary: