ADCP data handling - faster visualisation or ADCP ensemble averaging for single ping datasets

aodn / imos-toolbox

Graphical tool for QC'ing and NetCDF'ing oceanographic datasets

GNU General Public License v3.0

46 stars 31 forks source link

ADCP data handling - faster visualisation or ADCP ensemble averaging for single ping datasets #654

Open BecCowley opened 4 years ago

BecCowley commented 4 years ago

After some research, I think we need a pre-processing tool to ensemble average single-ping data on import to the toolbox. Below is documentation of what I discovered (with some help from @sspagnol and David Williams). I am happy to contribute to the code development, but probably would be best written by @ocehugo & team. And, I need it soon if possible!

Investigation of using RDI tools to average: From David Williams (edited by me to remove irrelevant bits):

Hi Bec and Simon Attached is an attempt to do what you wanted. It was automated somewhat but velocity doesn't allow batch processing so that killed that part.

Anyway this is what I ended up doing:

Trimmed off the bad data ensembles at the beginning and end of the file.

Made a batch file that ran the file through BBSUB (no filter) and chopped it up into sequential files of 25000 ensembles each (so that ended up being 24 files). The 25000 ensembles was what I computed a Velocity screen could hold.

That wasn't so onerous and reasonably fast.

Then I had to load all the files into Velocity. Now this is the crazy bit. Velocity allows you to load all the files, you see the tabs for the files across the top of the screen. You can then select the cog wheels on the top right of the screen and select averaging and then 60 mins and apply all. It then averages ALL the loaded files. The onerous part is that you then have to export them one at a time. However that didn't take that long.

Then I did a copy/b filename??.pd0 csiro_file.pd0 (where ?? was 1-24). Then loaded into velocity and it seems OK.

This file is therefore now in earth co-ordinates. Reading through the velocity manual it says it only applies cell (bin??) mapping to flat files, that is no pitch and roll. So it's pretty useless. <EDIT HERE: if we look at the RDI commands manual, the same phrase is in that book, but it refers to Cell 1. My interpretation is that the manual has been badly written and bin mapping does account for pitch and roll in every cell except bin 1>

I think you'd be better off getting your IMOS programmers to read the binary, average it and do the beam to ENU conversion and then the bin mapping. Those routines are around for matlab, just need to find them. Cheers David

PS BBSUB batch file

set /a Last = 605583 set /a StartEns = 1 set /a EndEns = 25000 set /a vers=1 :sub BBSub.exe -in:22801001_sub0c.pd0 -out:bbcheck_sub%vers%.pd0 -start:%StartEns% -end:%EndEns% set /a StartEns = EndEns + 1 set /a EndEns = EndEns + 25000 set /a vers=vers+1 if %EndEns% GTR %Last% goto end goto sub

:end

Now looking at how other groups average single-ping data.

We have the UWA group and the Scripps group, both of whom collect their data in single-ping, beam coordinates. From the code I have, they both do similar processes. At the end, after the data is in Earth coordinates, the averaging is done using a boxcar averaging.
The USGS group have some code in python to transfer their single ping to netcdf (https://github.com/mmartini-usgs/ADCPy). I can't see any averaging in that code.
Finally, there is the dolphyn suite: https://lkilcher.github.io/dolfyn/. There is some optional binning, but I'm not clear on how that happens.

Probably, the simplest thing to do at this stage would be to use the UWA tools which are in matlab and quite straightforward and clearly described. They have tried to duplicate the on-board processing done by the RDI ADCPs when they ensemble average during deployment (including fish thresholds, EA thresholds etc). The code I have from UWA was sent to me via @sspagnol and possibly there are more recent versions we can ask them for. I also have the Scripps code, again in matlab.

ocehugo commented 4 years ago

Hi @BecCowley,

Could you please provide examples/cases of the functionality? For example, why you need to average the ping before the import!? Is this averaging required for modern pre-processing (conversions, adjustments, etc) requirements?

My first impression is that this would be a post-processing no!?

PS: AFAIK, the ADCPy & dolfyn packages allow only simple binning. I don't know what UWA uses, It would be nice to have a peek at the code. I/We need to also understand the full requirements for other users, conventions, etc.

BecCowley commented 4 years ago

Hi @ocehugo, The toolbox is far too slow when single ping data is loaded. I can be waiting for up to an hour for a figure to re-plot when changing view in the interface. And that is on a linux box. Averaging is required simply for practicality. Also, historically, all the deep water moorings data has been supplied in hourly ensemble averages. We would like to provide the single-ping data as hourly averaged data to be consistent. If required, we can provide the single ping data in the FV00 files. The averaging has to be done prior to import for the toolbox to handle it. Once the QC thresholds have been obtained, we could retrospectively apply them to the single ping data.

How the data is supplied is up for discussion - maybe we provide two QC'd versions, one in single ping and one in hourly averaged.

I can send you the UWA code I have if you like?

ocehugo commented 4 years ago

May you please share a very slow file example and describe the workflow/ steps where it is slow!? It's slow only for displaying or processing!?

IMO, we could try to optimise this first before changing the data workflow. I still think this should be a post-processing task, even if we have to create it/move it from pre-processing.

afaik we need to keep the raw pings in fv00 - that's why I consider the AVG a post-processing thing, even though some data shift is done at import level.

If we can't optimise much, maybe we could average on the fly at plot time (via option), via exclusive button on the main toolbox window, or add something like a windowed plot around large variances ( e.g. manual qc flags), or some similar heuristics. Other option is to load the slow thing and generate an average file that can be loaded ( say FV00ba), which is something similar to what wqm people wish.

Is the UWA code doing anything else besides binned avg that we would need!?

BecCowley commented 4 years ago

Hi Hugo, happy to share. On some moorings, there are 3 single-ping ADCP instruments. Plus, all the other seabirds and aquadopps. Here is an example from the 3200m mooring:

Depth	Instrument	Serial No
40	SBE37	9340
60	SBE39(T&P)	8436
80	SBE37	9897
100	SBE39 T	3916
120	BB150 (up)	17959
120	LR75(down)	14292
135	SBE37	9907
170	Star Oddi	4008
200	SBE37	11401
250	SBE39 T	4169
300	SBE37	9184
350	SBE39 T	4326
400	Star Oddi	4010
485	SBE37	15132
610	SBE37	15883
650	LR75(down)	24475
800	SBE39(T&P)	8516
1000	SBE39 T	4171
1190	SBE37	9187
1195	Aquadopp	AQD-5938
1300	Star Oddi	4011
1500	Aquadopp	AQD-5976
1500	SBE39(T&P)	8517
1750	Star Oddi	4012
2000	Aquadopp	AQD-5928
2005	SBE37	13082
3080	Aquadopp	AQD-5878
3080	SBE37	7897

If I load up the entire mooring, it is impossible to work the toolbox. I must admit, I haven't tried this again for a while as I was so discouraged the first time. My recollection is that when I tried to QC the ADCP data, I couldn't assess the toolbox adequately because of Issue #636 - If I zoom into an area of the ADCP data, then decide I wanted to see a plot that wasn't there, the toolbox slows as it has to re-draw the entire plot again. Then I have to re-zoom to where I was (again, taking forever).

I think the problem is the way the toolbox is rendering these huge files in the plots. I don't know if having two other ADCPS and all the other instruments makes a difference.

Would you like me to share the data from an entire mooring (and the associated csv files)?

As for UWA, I have only focussed on the binning step. Some other things to consider:

Our data is currently collected in earth coordinates. I think they collect in beam coordinates and do the transformation.
They also apply the fish threshold and a couple of other thresholds before coordinate transformation and binning. These steps are performed by RDI when binning is done by the instrument. We may want to consider allowing the user to set these on import.

ocehugo commented 4 years ago

Hi Hugo, happy to share. On some moorings, there are 3 single-ping ADCP instruments. Plus, all the other seabirds and aquadopps. Here is an example from the 3200m mooring:

That's a long beast :)

I couldn't assess the toolbox adequately because of Issue #636 - If I zoom into an area of the ADCP data, then decide I wanted to see a plot that wasn't there, the toolbox slows as it has to re-draw the entire plot again. Then I have to re-zoom to where I was (again, taking forever). I think the problem is the way the toolbox is rendering these huge files in the plots. I don't know if having two other ADCPS and all the other instruments makes a difference.

Yeap , the repeat/slow drawing is a problem - so i assume it's only the display/plots that are a nuisance, correct!? I reckon we got two options here a. only render/draw what the user is seeing and/or b. cache the first draw and re-use. But I didn't look into it yet.

Would you like me to share the data from an entire mooring (and the associated csv files)?

This would be perfect! I don't have such a dense example here and it would be very helpful - I'm already even thinking it would be a great example to template on top for even easier test here.

As for UWA, I have only focussed on the binning step. Some other things to consider: Our data is currently collected in earth coordinates. I think they collect in beam coordinates and do the transformation. They also apply the fish threshold and a couple of other thresholds before coordinate transformation and binning. These steps are performed by RDI when binning is done by the instrument. We may want to consider allowing the user to set these on import.

Humm, so they are probably doing more checks, maybe because of internal wave scenarios!? I think having a peek at the code would be nice, as well as a quick chat (or docs) about what is being done. Both would be a good start to check what we are skipping/missing (and a recap for me). I'm not versed on what RDI does. The only thing that keeps on the back of my mind is the possible downstream impact on the current processing workflow with different thresholding.

BecCowley commented 4 years ago

@ocehugo, I will send you a link to our data for you to test. And the UWA code. We can work through the pre-processing parts together - we turn off a lot of the RDI thresholds anyway, but they should be coded in somewhere. There is a definite process that the RDI's follow when in on-board averaging mode, and it's just a matter of making sure we get it all covered for the single-ping mode.

BecCowley commented 4 years ago

@ocehugo, I need to get our EAC ADCP data QC'd in the next week. So, I'll make my own efforts in matlab to do that. If you manage to get to this issue soon, let me know.

It seems that your thoughts on QCing first, then averaging work with what I've discovered below. I'm documenting the following information for anyone who is interested and so it isn't lost.

Reviewing the information available, here are my notes on how UWA perform QC of their single-ping data. The process matches with the RDI documentation 'ADCP Coordinate Transformation. Formulas and Calculations' (adcp+coordinate+transformation_Jan10.pdf).

The UWA code is located here: https://bitbucket.org/uwaoceandynamics/adcp-rdi/src/master/

UWA single ping ADCP data processing steps. Data collected in beam coordinates and single ping setup. All thresholds are turned off on the instrument during collection so the ADCP does no QC on board. For the CSIRO instruments, the only difference is we are collecting the data in ENU coordinates, so the bin mapping, 3-beam solutions and beam to ENU coordinates are done on board the ADCP.

adcprdi.m First step of reading the *.000 files: QC steps completed:

fish detection (adcpfishdetection.m)
correlation qc < threshold
percent good qc < threshold (only for data collected in beam coords). No percent good for ENU data
At this stage, they ask if they should be doing a low echo intensity QC, but it’s not implemented

Error velocity: In beam coordinates, error velocity is set to zero as it hasn’t been calculated until the conversion to ENU

Perform bin mapping and 3-beam solutions. Then convert to ENU data (derives east, north, up and error velocities)

adcp_postclean.m Data is now in ENU coordinates (bin mapped) and single ping format. Equivalent to what CSIRO has collected. The two steps that haven’t been performed yet for the CSIRO data that will need to be done to be equivalent to UWA at this stage are:

fish detection
correlation qc < threshold

Tests done now in UWA:

Echo amplitude surface test OR use pressure to cut the surface bins

Add magnetic declination

Error velocity > threshold
Pitch & roll > threshold

adcp_avg.m QCd data is now averaged. Time bins and z bins if required.

Binning process is done over multiple deployments and time gaps are filled with NaNs to an even grid (fill_ts_adcp.m, filldate_adcp.m and join_ts_adcp.m)

ocehugo commented 4 years ago

@BecCowley - thanks for the information. As you could see, we had prioritized other issues and this was postponed for later in the roadmap. The information above is super-useful and we should include it in the wiki after we can actually do the entire thing in the toolbox.

Thinking forward a bit, what is your opinion about what should be done? my guess is to Follow UWA steps, assuming the fish detection work as expected for beam data, as well as the other tests

BecCowley commented 4 years ago

@ocehugo, I'm just working through what we should do next in my head. Here's my thoughts:

Map the process described above with what the toolbox already does. Plus add anything missing, which gets us to the stage of a mostly QCd and averaged dataset. I think it can be done prior to the visualisation in the toolbox GUI. I think the toolbox should only display the bin-averaged data upon which we can do further QC if required.

I'd like to make this a pre-processing step, with some plots available outside the toolbox (or during the import/pp step) to help make decisions about the thresholds to use. For example, plots I find useful are mean/min/max/std deviation plots of correlation, percent good, echo intensity, error velocity. There will of course be a fish threshold setting if the ADCP hasn't applied one already. I also like to see pcolor plots of the time series vs pressure/depth with the data points that have failed with each threshold. I can share examples, but these will be slow to produce & review in single-ping datasets. Perhaps they can be done one at a time.

The user should be able to adjust the thresholds as many times as they need before continuing to the toolbox QC gui. I wonder if we can build a separate tool to just use on ADCP data for determining the thresholds for each instrument? It would be useful for all ADCP datasets.

The process above doesn't include the vertical velocity and horizontal velocity QC tests (as well as all the other standard QC tests). These need to be done after the averaging step.

The idea is to minimise the amount of data the toolbox displays to reduce the time it takes to QC the large dataset.

If all the thresholds are set well, the data shouldn't need any further QC. It's the setting of thresholds and trying to retain as much good data as possible that is time consuming with big datasets due to the size of the plots. Reviewing and adjusting them is slow.

ocehugo commented 4 years ago

I'd like to make this a pre-processing step, with some plots available outside the toolbox (or during the import/pp step) to help make decisions about the thresholds to use. For example, plots I find useful are mean/min/max/std deviation plots of correlation, percent good, echo intensity, error velocity. There will of course be a fish threshold setting if the ADCP hasn't applied one already. I also like to see pcolor plots of the time series vs pressure/depth with the data points that have failed with each threshold. I can share examples, but these will be slow to produce & review in single-ping datasets. Perhaps they can be done one at a time.

Agree - we need a step that allows easy back/forth inspection/exploration - this is currently a weak spot.

By your description above, I think the best way to implement that is a new "button" or "view" in the main toolbox window (e.g. "run diagnostics/processing" ). It's more flexible such to avoid re-importing workflows (and slowness) and also allow saving the raw data.

Again, Although one could argue that, if the manufacturer software is doing averages, we should do the same and follow from that, I point to the actual UWA workflow above as a de-facto example of why we shouldn't follow this line of thinking.

The user should be able to adjust the thresholds as many times as they need before continuing to the toolbox QC gui. I wonder if we can build a separate tool to just use on ADCP data for determining the thresholds for each instrument? It would be useful for all ADCP datasets.

The "view/window" mentioned above would be useful for other things too (say QC exploration).

This is quite some work (apart from the workflow design and metadata registry), particularly with UI elements/interaction since every UI element/interaction takes some time to be fully tested.

However, If during your diagnostic/parameter exploration you could modularize the plots/analysis we could probably reuse/adapt it here to start ticking boxes towards that. Also, the workflow design is pretty much what you'll be doing and such, we just need to fence it correctly over here (ensuring order of processing, parameter range, etc).

The process above doesn't include the vertical velocity and horizontal velocity QC tests (as well as all the other standard QC tests). These need to be done after the averaging step. The idea is to minimise the amount of data the toolbox displays to reduce the time it takes to QC the large dataset.

I completely agree in reducing the space dimension for the end-products, but I still stand for "allowing raw-data to be stored/seen/used by whoever is interested".

Again, I got some options to try to speed things up, but I cannot commit to investigate right now.

BecCowley commented 4 years ago

@ocehugo, thanks for the feedback. I plan to dig into the current PP tools and see if I can use what is there to adapt. I will come up with a workflow and document it for future incorporation into the toolbox.

One issue I'm unsure of is the delivery of single ping data and hourly averaged data to the AODN. So far, I have thought of these options:

Deliver the raw, un-qd'd single ping data as FV01 format files. Deliver the QC'd hourly averaged files as FV01 format.
Deliver the single ping data in both FV00 and FV01 (qc'd). Deliver the hourly data as FV02 format.

I am not clear on how to deliver non-qcd hourly data, simply because the process we are looking at does the QC before the averaging.

Any thoughts on this would be helpful.

ocehugo commented 4 years ago

@BecCowley,

I plan to dig into the current PP tools and see if I can use what is there to adapt. I will come up with a workflow and document it for future incorporation into the toolbox.

This is great - keep me posted!

One issue I'm unsure of is the delivery of single ping data and hourly averaged data to the AODN. So far, I have thought of these options:

Deliver the raw, un-qd'd single ping data as FV01 format files. Deliver the QC'd hourly averaged files as FV01 format.

Deliver the single ping data in both FV00 and FV01 (qc'd). Deliver the hourly data as FV02 format. I am not clear on how to deliver non-qcd hourly data, simply because the process we are looking at does the QC before the averaging. Any thoughts on this would be helpful.

I'm not certain of your delivery requirements. AFAIK, you may just deliver the FV00 (raw, non-qc,single-poing) and the FV01 (qced). If you can do it the QC over single-ping, great, this is your FV01 file.. However, if your workflow (whateveer reason) can only be done over hourly data, this is it, your FV01 data is hourly average with QC flags, and differs from FV00 in both QC content, processing/sampling length.

The aodn team had created some code to generate FV02 files, like aggregations and time averages. It may worth to check if you really need to create those.

BecCowley commented 4 years ago

@ocehugo, I have constructed a 'singlePing' branch of the toolbox to enable the QC of the data. How best to hand over what I've done? I can step through it with you first if you like, then maybe you can take the branch over?

For the files that need to be sent to IMOS (single ping and averaged), I will open the conversation with @mhidas to start. I guess @petejan would like to be involved too. Anyone else who should be involved that you can think of?

ocehugo commented 4 years ago

@BecCowley , just send us a PullRequest from your fork and we go from there.

I think this discussion belongs to the ANMN internal discussions, but better to start the discussion with some "draft proposal" so it's not left open for too long. @ggalibert !?

ocehugo commented 3 years ago

Just adding in a list of possible factors here that would eventually improve performance, so I don't need to go back to my notes:

[ ] Re-use instead of redrawing, avoid repeated calls to graph functions altogether.
- [ ] Temporary button/menu in the UI to toggle this functionality ("e.g. redraw = True"), so we are safe from breaking workflows and bugs.
- [ ] Store in memory the different graphs, use visible/invisible to toggle graphics and reshuffle positions in place.
  - [ ] Store in memory the conversions (e.g. NaN assignments for QCs), use/re-use QCed variables too.
  - [ ] Provide a function to check if current QCed data in a CData image array is equal to current QC state.
  - [ ] Change the image object arrays (e.g. CData) instead of redrawing.
[ ] maybe swap pcolor/shading by imagesc/imshow for equally spaced data.