NOAA-PMEL / LAS

Live Access Server
https://ferret.pmel.noaa.gov/LAS/
The Unlicense
13 stars 5 forks source link

SOCAT: deal with discovered problematic data #1435

Closed karlmsmith closed 6 years ago

karlmsmith commented 6 years ago

Reported by @karlmsmith on 22 Oct 2012 22:07 UTC Communicate with Benjamin to resolve problematic data issues:

-- Many 49RY and 49UP cruises are duplicates and should have been a rename (from a change/correction of ship name)

-- Expocode 746P20031121 given to cruises PB07_FinalCO2_325_eq (flag = S) and PB07_FinalCO2_326_eq (both flag S due to questionable data). Appears the 326 cruise should be 20031122

-- Mooring 316420040523 rename of 48MV20040523 (or vice versa?)

-- Mooring 20060608 rename to 35DR20060608

-- C49K619871028 and 49NB19981228 have a handful of invalid times due to pre-advancement of the month (11-31 found between 10-30 and 11-01; 2-31 found between 1-30 and 2-1). Need to check if this pre-advancement is elsewhere but not detected because the date is valid. (There are some quirky looking points in time vs lat or lon.)

-- Cruise (Expocode 49NZ20090424; Cruise Name 49NZ20090410) has many data points with seconds above 60 (hundredths of a minute?).

-- 77FF20020226 has seconds of exactly 60.000 (round up of 59.9999?)

-- Cruise Expocode 33RO20090511 has one data point with a seconds value of 39.88 which in the database is 39. With a change to float for seconds, correct this value. However, this point had already been given a WOCE flag of 4 (stands out so something else wrong with it).

Migrated-From: http://dunkel.pmel.noaa.gov/trac/las/ticket/1429

karlmsmith commented 6 years ago

Modified by @karlmsmith on 22 Oct 2012 22:26 UTC

karlmsmith commented 6 years ago

Comment by @karlmsmith on 23 Oct 2012 17:13 UTC Also 06AQ20100131 (ANT26-3):

From: <Kristina.Paterson@csiro.au>
Date: Thu, Oct 18, 2012 at 8:23 PM

I need to apply a large number of flags to a cruise: ANT26_3.

Delta T is often >1 C so QC protocol requires a 3 flag, even 
though the data is good and the warming is due to southern 
latitude.  Using the LAS server is going to take a very long 
time. Is it possible for you or another SOCAT administrator 
to apply these flags please (3 flag to any data point with 
Delta T > 1 C) ?
karlmsmith commented 6 years ago

Comment by @karlmsmith on 23 Oct 2012 20:21 UTC E-mail sent to Benjamin, cc'd Steve and Dorothee:

Hi Benjamin,

I have compiled a list of issues that have been reported to us or we have discovered. 
(Apologies that this has become a long list and thus a long e-mail.)
I have them listed here and need a decision on what to do about these issues.

-- Karl

(1) Cruise ID PX1120E (Expocode 49P120080624) should be PX120E as reported by Sumiko.

(2) From Kristina Paterson:

    I need to apply a large number of flags to a cruise: ANT26_3.

    Delta T is often >1 C so QC protocol requires a 3 flag, even
    though the data is good and the warming is due to southern
    latitude.  Using the LAS server is going to take a very long
    time. Is it possible for you or another SOCAT administrator
    to apply these flags please (3 flag to any data point with
    Delta T > 1 C) ?

Should we suspend the cruise, apply the WOCE flag of 3 to all the 
appropriate data, or does her comment convince you that the data 
should not be give the WOCE flag of 3 and just left as-is?

(3a) Many duplicated cruises 49RY... and 49UP... (see attached 
OverlapSummary.txt).  From the Coastal QC meeting we got confirmation 
that the ship name was Ryofu Maru III (the 49UP... expocodes) instead 
of Ryofu Maru (the 49RY... expocodes) and so apparently these are 
duplicates.  Which Expocode should we keep? 
I noticed Expocode prefix 49RY is for ship Ryofu Maru and for Ryofu Maru II.
(3b) The ship name Ryofu Maru is used in time ranges where there are 
also Ryofu Maru II (Nov 1989 - Jan 1995) and Ryofu Maru III (July 1995 
- Mar 2011) ships.  Are there three separate ships or do ship names 
(and possibly expocodes) need to be updated?

(4) Other duplicated time-location between cruises listed in 
OverlapSummary.txt.  Most are these are tail of one cruise (leg) 
repeated in the subsequent cruise (leg).  Should anything be done to 
mark these cruises or cruise data points as having this issue?

(5) Apparently a duplicate mooring 20060608 and 35DR20060608, although 
the data is slightly different.  Should this have been a rename?  If 
not, is there an updated expocode for the 20060608 mooring?

(6) Expocode 746P20031121 is given to cruises PB07_FinalCO2_325_eq and 
PB07_FinalCO2_326_eq (both flag S due to questionable CO2 data).  The 
326_eq cruise starts on 20031122.  Can you give me an updated 
expocodes for one (or both) of these cruises?

(7) C49K619871028 and 49NB19981228 have a handful of invalid times due
to advancement of the month value one data line too early (11-31 
found between 10-30 and 11-01; 2-31 found between 1-30 and 2-1). 
There are some quirky looking points in time vs lat or time vs lon 
(single points sticking out from an otherwise smooth line), suggesting
there may be other pre-advancements of the month which did not result 
in invalid dates.  I am not sure how the month field is getting 
assigned in your data files.  Should these cruises be suspended 
(thinking many dates may be shifted) or just WOCE-flag out the invalid
looking points?

(8) (Expocode 49NZ20090424; Cruise Name 49NZ20090410) has many data 
points with seconds above 60 seconds.  (Maybe hundredths of a 
minute?)  Should this cruise be suspended waiting for corrected 
seconds?  I am not sure how the seconds are getting assigned in your 
data files.

(9) 77FF20020226 has seconds of exactly 60.000.  Are these round-up 
from something like 59.9999 and so should be treated as a minute later 
with seconds of zero?  Or should the cruise be suspended waiting for 
corrected seconds?

(10) 33RO20090511 has one data point with a fractional seconds value 
of 39.88.  This point had been WOCE-flagged 4, presumably because it 
stands apart from the rest of nearby data.  (We will be adding WOCE 
flagging comments to SOCAT3.)  Just pointing this out in case there 
was a mistake in creating the data file.

(11) Should there be expocodes with prefix 3206 as well as 32O6 (zero 
versus oh), or should these prefixes have been the same (if so, which)?

(12) The following expocodes do not fit the pattern of 4-character 
prefix, year, month, day.  I just wanted to verify these were correct 
and that I have not missed a rename or that a mistake was made:
19960803
19970619
06AQ1995112
3250TN19931005
48AY200700510
746P20030222_1
746P20030301_1
746P20030315_1
746P20030322_1
746P20030715_2
There are also 32 cruises with Expocode prefix "XXXX".  See attached 
ExpocodesX.txt, in case any of these should have been renamed.

(13) Any decision yet on what to do about the WOCE-flag files that 
were discovered in our SOCAT documents directories?

(14) What should be done to make sure Dorothee's (and any other) 
metadata files added in our SOCAT documents are archived and available 
to the general public?  Or should they be sending these documents to 
you directly?

(15) How do the WOCE flags set in our database get back to you?  Do 
you download the data from the SOCAT database or do we send you 
something?  
I discovered the "Data" button from the table of cruises just pulls 
data from a svn archive and as such will always be out-of-date.  I 
will be changing this at some point to pull data directly from the 
database, as is done with "Dowload Data..." -> "All Variables", so 
the information (the WOCE flags in particular) are always up-to-date.

(16) What should be done with the duplicated time-locations within a 
cruise?  Should these cruises or cruise points somehow be marked?  I 
have uploaded documents into SOCAT that give details about these 
duplicated time-locations.  Some of these appear to be as Rik 
described: five repeated measurements all reported.  But others are 
like Dorothee's mysterious additional data values, and some appear as 
if they are corrections adding additional data.  I have verified that 
the data in the database exactly match all the data in the archived 
*.mat.txt file that came from you.  (The only exception is the one 
fractional seconds data point, and I will fix that in case future 
cruises have fractional seconds.)  There are currently 3103 of these 
files, which matches the number of cruises in the database.
karlmsmith commented 6 years ago

Comment by @karlmsmith on 24 Oct 2012 18:05 UTC Dorothee's response

Thank you for this extensive list.

Below a quick response for some items.

On(1) For you and Benjamin.

On (2) see separate email. Data with dT>1  in cold waters should not 
be given flag 3, unless there are additional concerns.

On (3) for you and Benjamin, relevant QCers and PIs. . The final 
product should NOT contain duplicate cruises.

On (4) for you and Benjamin, relevant QCers and PIs. The final product 
should NOT contain duplicate data / overlapping data. Otherwise we can 
hardly expect all QCers to work through the long list of duplicate 
cruises/data lines. Ideally a warning should show up when one is doing 
QC of a specific cruise / PER REGION

On (5) as (3). Are these duplicate data? If so, at most one 
cruise/data set should be retained.

On (6)  If this is one data set (duplicate, overlapping data) they 
will have the same expocode.  In Tsukuba we agreed that wherever 
possible the Expocode will have the date the ship leaves port/ a 
research cruise officially starts (even if the CO2 instrument was 
switched on a few days later).

On (7) The cruise should be suspended if there are more than 50 bad 
data points (bad month/date/position/salinity/SST/Tequ/Patm/..). Bad 
data should be flagged 4 for a cruise with fewer than 50 bad data 
points. There seem to be a lot of such cruises in version 2. Yesterday 
I suspended many cruises (~15) for this reason.  Such cruises will be 
updated (either by new submission, new ingestion into SOCAT, new 
recalculation with checking for bad data) in the next version of SOCAT.

On (8). For you and Benjamin. Strange. Matlab code might need 
correction for future versions. If this only affects one cruise, 
suspension might be a good idea. If this affects many cruises, a fix 
might be appropriate.

On (9). As (8)

On (10). For you and Benjamin.  Is it necessary to have decimal 
seconds? Or can seconds be rounded to whole seconds. Otherwise 
lets not worry about one data point.

ON (11) For you and Benjamin.  Confusing indeed.

On (12) For you and Benjamin

On (13) For you and Benjamin

On (14). Since the option to upload metadata exists, a QC-er would 
logically expect these metadata to be added to the cruise for version 2.

If you do not want this to happen, than the upload option should be 
disabled. Not sure how useful the googledocs is. Are there any 
advantages? I preferred the previous setup with access to data from a 
specific cruise only. Now I need to search for each cruise.

On (15) For you and Benjamin. Yesterday the system allowed me to add 
WOCE flags for cruises in version  1.4 (bad salinities)???? Surely 
this should not be possible or be treated with great care.

On (16) For you and Benjamin. Potentially this is a serious issue. 
E.g. the 1993 Polarstern cruise you asked me about should not have any 
duplicate times with data points about 9 minutes apart. The cause of 
the duplicate times needs assessing before a decision is taken on how 
to deal with them.

A final question

Is there a mechanism for flagging bad cruises / data in version 1.4 
(bad salinities of -999 and -9.95 and 0 far from land)?
karlmsmith commented 6 years ago

Comment by @karlmsmith on 24 Oct 2012 18:33 UTC Benjamin's response

thank for the mail. I answer in the mail below - might be easier that way.
Concerning expocodes. We agreed to use the NODC codes but there is one 
major issue. NODC did not require as much information as needed for 
assigning them (so duplicates occurred), the codes were changed a lot 
and there is no version history available at NODC. One example the 
vessel Ryofu Maru was first called just Ryofu Maru with the expocode 
49RY afterwards changed the name to Ryofu Maru II. And Ryofu Maru III 
was registered. Another issue was that research vessels eg Discovery 
had two NODC codes 74E3 and 74DI.

NODC does not assign codes anymore - now ICES (international council 
of the sea in Copenhagen) is assigning them and tries to solve the 
mess. NODC gets updates on NODC from ICES. Which gets even more 
complicated since the international agreed code for Discovery is 74E3 
which is not used in our community - we use the 74DI. I spent some 
time this summer sorting those issues out with ICES.  Now ICES just 
assigns new codes after contacting a steering group and after 
receiving detailed information about the vessel. They also keep a 
history about former names, previous expocdes...

The main issue is that we have to change some expocodes for some 
vessels in order to be consistent in the future. If you have better 
ideas please let me know.

See my other comments below.

best
Benjamin

> (1) Cruise ID PX1120E (Expocode 49P120080624) should be PX120E ....
No problem at all - the cruises was named PX1120E in the file I got - 
will be changed when data will be archived.

> (2) From Kristina Paterson: ....
Dorothee answered

> (3a) Many duplicated cruises 49RY... and 49UP... ....
There were different vessels according to the metadata.

> (3b) The ship name Ryofu Maru is used in time ranges where ....
There are at least two vessels

> (4) Other duplicated time-location between cruises listed ....
I have to take a detailed look

> (5) Apparently a duplicate mooring 20060608 and 35DR20060608 ....
Those data were reported three times and every file had a different duration. 
Previous there was no expocode for moorings and we use 35DR for that mooring. 
i will talk back to the PI which file should be used.

> (6) Expocode 746P20031121 is given to cruises PB07_FinalCO2_325_eq ....
Use 746P20031122 for PB07_FinalCO2_326_eq - the file was updated but 
not relevant since those data will not be included in SOCAT.

> (7) C49K619871028 and 49NB19981228 have a handful of invalid times ....
This happens a lot when some PIs calculate the date from julian day. 
I use the date as stated in the file and correct it when I encounter 
those issues. Often it is not clear how julian day was defined in the 
data (January 1 = 1 or 0). I will take a look at those cruises.

> (8) (Expocode 49NZ20090424 ... has many data points with seconds above 60 seconds....
I made a mistake while transforming data they were reported in a 
uncommon way (0 to 235959) and the software I use takes decimal 
seconds as well. Those cruises will be suspended and added in another version. 

> (9) 77FF20020226 has seconds of exactly 60.000.  ....
PI reported both seems like his scripts had rounding issues. 
He reports 0 and 60. I talk back to the PI.

> (10) 33RO20090511 has one data point with a fractional seconds ....
ok

> (11) Should there be expocodes with prefix 3206 as well as 32O6 ....
No they are one. Both were used by the PI - looks like it was a 
reading/seeing issue. 3206 (zero) is the correct one. 

> (12) The following expocodes do not fit the pattern of ....
> 19960803
> 19970619
no expocodes assigned in V1.5 since there was no NODC code available for moorings
> 06AQ1995112
no idea where this one comes from - it was corrected long time ago to 06AQ19951112
> 3250TN19931005
Version 1.5 should be named 325019931005
> 48AY200700510 corrected to 48AY20070510
> 746P20030222_1
> 746P20030301_1
> 746P20030315_1
> 746P20030322_1
> 746P20030715_2
746PXXX I think all of them were in version 1.5 and were suspended. 
Those cruises were obtained on a ferry that frequently operates. when 
some bad data was deleted on the first day of the cruise- two cruises 
had the same expocode. 
> There are also 32 cruises with Expocode prefix "XXXX"....
XXXX was used if no NODC could be assigned back then they just wanted 
ships registered and those ones were eg buoys, moorings

> (13) Any decision yet on what to do about the WOCE-flag files ...
It looks like Anna Lourantou created them. Her latest comment was: 
ok if I say I am the one who did this, am I to be punished???

> (14) What should be done to make sure Dorothee's (and any other)....
Whatever is easiest - I can update them.

> (15) How do the WOCE flags set in our database get back to you? ....
Heather sent me all data last time with all WOCE flags.

> (16) What should be done with the duplicated time-locations ....
I take a detailed look and will answer.

and my reply

Thanks for the responses.  I will make appropriate changes in the 
database where you have indicated what to do.

I guess we need to contact the PI regarding the 49RY/49UP cruise 
duplications to resolve the situation.

And tell Anna Lourantou that she is to be thanked for the QC work, not 
punished.  I just need to get this information into the database (for 
data points that do not already have a WOCE flag).  That was all I was 
trying to find out.

If you can deal with Dorothee's metadata files this time, that would 
be wonderful.  I need to incorporate some sort of alerting mechanism 
so file uploads such as this and Anna's are always seen and 
appropriate follow-up actions are taken.

I will assume that you will let me know when you want a listing of the 
flags (QC and WOCE) and in what format.  Maybe these files that look 
like your mat.txt files, but have some additional columns, are what 
you will want.

Also responded to Dorothee; she was able to flag cruises as needed. She did think that maybe some warning would be appropriate if flagging a cruise that was not updated or new to the latest version.

karlmsmith commented 6 years ago

Comment by @karlmsmith on 16 Nov 2012 17:55 UTC (1) Change cruise_name to PX120E for expocode 49P120080624

(2) nothing to do

(3a) 49RY -> 49UP : Dorothee has excluded all the duplicated 49RY cruises, and the 49UP cruises are being QC'd as this was also a data update, not just a rename.
So this works around the problem, although ideally: (i) the X flag should be changed to an R, and comment changed to indicate the rename (ii) an entry put into the cruise_identification_log indicating the rename, (iii) flags for duplicate 49RY cruises duplicated (with an old date) to the 49UP cruises and then be modified indicating the rename/copy, (iv) all WOCE_flags, data_bak, and cruise_regions for the duplicate 49RY cruises be deleted from the database (v) any doi table entries be updated

(3b) nothing to do

(4) nothing to do at this time. Some more duplicates/overlaps reported in newly ingested data.

(5) nothing to do at this time. expocode/cruise ID 20060608 changed to XXXX20060608 for consistency

(6) PB07_FinalCO2_325_eq cruiseID changed to 746P20031121; PB07_FinalCO2_326_eq changed to 746P20031122. Note: PB07_FinalCO2_326_eq had cruise_ID 746P20031121 at one time; this instance changed to 746P20031121_orig in all the tables.

(7) Presumably the update was to fix these issues, but need to retest.

(8) nothing to do at this time - still in there after the update, but 49NZ20090424 (with sec > 60) was suspended.

(9) nothing to do at this time - 77FF20020226 (with sec = 60) is flag C.

(10) seconds field in the data was changed to float

(11) 32O6 (oh-6) expocodes renamed to 3206 (zero-6), with cruise_ID changed to match expocode. The TOW5-1, -2, -3, -4, -5 cruises given expocode/cruise ID 320619971205-1, -2, -3, -4, -5.

(12) Renames: 19960803 XXXX19960803 19970619 XXXX19970619 20060608 XXXX20060608 06AQ1995112 06AQ19951112 3250TN19931005 325019931005 48AY200700510 48AY20070510 746P20030222_1 746P20030222 746P20030301_1 746P20030301 746P20030315_1 746P20030315 746P20030322_1 746P20030322 48MV19991018 35MF19991018

(13) still to do, but first need to get expocodes from v1.4. They might have already been set in there and then got lost.

(14), (15), (16) - nothing to do.

I am going to close out this ticket and create new tickets with the individual issues still needing to be resolved.