aodn / public-schema

Shared schema specifications for data exchange (WFS, CSV, etc...)
GNU General Public License v3.0
1 stars 0 forks source link

SQL code for Phytoplankton products #5

Closed mhidas closed 3 years ago

mhidas commented 3 years ago

Here's a draft of what the product generation code will look like in SQL. The pipeline will read these .sql files from here and apply them in the database after the raw data have been harvested.

mhidas commented 3 years ago

@sedgar @leonardolaiolo FYI

mhidas commented 3 years ago

@clairedavies @sedgar @leonardolaiolo FYI, I have just rebased this branch on top of the changes recently merged to master (from #6). In case you have previously checked out the bgc_plankton_products branch, please delete it (your local version of it), pull down the new changes and check it out again. If you haven't checked it out yet, you can ignore this message.

mhidas commented 3 years ago

@clairedavies I have finished all 8 phytoplankton products (abundance & biovolume), so I think we can now review & merge this PR. @leonardolaiolo Just started working on the zooplankton products in a separate branch.

mhidas commented 3 years ago

In your case you are appending the metadata after the pivot and therefore you would want no value and not zeros for these instances.

Actually that depends on what you would like to see in the final CSV output for a trip that currently has no data associated with it (at a given taxonomic grouping level). Should it be

  1. Excluded from the output completely?
  2. A row with blank cells (=nulls) in all columns?
  3. A row with zeros in all columns? (this is the current behaviour)
  4. A row with null or zero depending on the changelog startdate for each taxon?
clairedavies commented 3 years ago
  1. please it is more correct, as we don't know what is 0 at that stage, this is an improvement on 3.
mhidas commented 3 years ago

:+1: That makes sense. Ok I'll remove the dummy values.

leonardolaiolo commented 3 years ago

I'll do the same for the zooplankton :+1:

mhidas commented 3 years ago

I have updated all the phytoplankton views. The resulting data is now available in my demo portal.

mhidas commented 3 years ago

Just noting that there is currently a small bug in the geoserver output format that converts the aggregated jsonb values into separate columns. When the entire jsonb value is NULL (for trips that have no data), that row will actually be missing the entire section where the abundance columns should be, including the comma delimiters. So instead of something like

TripCode,Taxon1,Taxon2,Taxon3
XYZ20210907,,,,

It will have

TripCode,Taxon1,Taxon2,Taxon3
XYZ20210907,

Technically this means it's an invalid CSV file, because that row has missing columns. However, I think most spreadsheet applications like Excel will still be able to open it as if those empty cells were there.

We'll fix the bug when our software engineers can get to it.

(If this doesn't make any sense, just ignore it - it will go away eventually)

clairedavies commented 3 years ago

One thing I have realised needs changing, and this is a change to the existing code as I had forgotten this until now, is that Ciliates, Silicoflagellates, Radiozoa and Foraminiferas were not counted before 2012-08-07 before my group took over the counting. So all values for these HTGs should be null and not 0 prior to this date. You don't have to use the change log for this as nothing else is likely to change at this level, so it can be coded. This does not apply to the CPR data where we counted these thorughout and is why we use different change logs.

clairedavies commented 3 years ago

Technically this means it's an invalid CSV file, because that row has missing columns. However, I think most spreadsheet applications like Excel will still be able to open it as if those empty cells were there.

Excel opens it fine, but it will matter in R, thanks for letting us know that is the reason.

mhidas commented 3 years ago

Ciliates, Silicoflagellates, Radiozoa and Foraminiferas were not counted before 2012-08-07 before my group took over the counting. So all values for these HTGs should be null and not 0 prior to this date.

Ok, I'll look into that soonish. Hard-coding something like that is not the cleanest solution, it's probably the easiest and quickest :)

mhidas commented 3 years ago

Ciliates, Silicoflagellates, Radiozoa and Foraminiferas were not counted before 2012-08-07

After talking to @leonardolaiolo about this, we're a bit unsure - is this for phytoplankton or zooplankton?

clairedavies commented 3 years ago

Phytoplankton

Ciliates, Silicoflagellates, Radiozoa and Foraminiferas were not counted before 2012-08-07

After talking to @leonardolaiolo about this, we're a bit unsure - is this for phytoplankton or zooplankton?

Just phytoplankton - our group did not initially count these samples and the person who did count them did not include these groups. We have always counted the zooplankton and so have included these groups from the beginning.

mhidas commented 3 years ago

@clairedavies I just updated the two phyto HTG products. These have also been updated in the demo portal.

If you're happy with these products for now, could you please submit another review and select "Approve"? Then we can merge this (which you're welcome to do as well).