aodn / nrmn-application

A web application for collation, validation, and storage of all data obtained during surveys conducted by the NRMN
GNU General Public License v3.0
4 stars 3 forks source link

Data errors #1354

Open bpasquer opened 5 months ago

bpasquer commented 5 months ago

From Lizzi's email 2024-04-10:

Hello NMRN team,

We have been alerted of some major data discrepancies for data ingested in the period of January 2023 – we have not yet investigated the full extent of this issue, i.e. all the affected jobs, but here is an example for Job ID 147. We have compared the staged sheet with the ingested sheet and found the following:

Out of the 123 rows staged, only 52 rows were ingested. 79 rows are missing which equates to 13,718 individuals/sightings. We have attached the ingested data file (RLS CANADA_2021_missingdata_ingested.xlsx) and the original staged data (RLS CANADA_2021_missingdata_stagedhighlighted.xlsx) for you to look at. Data (method 1 and 2) was not ingested where the total column did not equal the sum of columns T-AV (non-sized inverts and sized columns). The usual blocking validation error for this did not work – eg. rows highlighted yellow in attached Some sized method 2 and 0 invertebrate data was also not ingested where total does equal columns T-AV (non-sized inverts and sized columns) – eg. lobster highlighted orange in attached. Note that the job was staged under the RLS program without extended invert sizing, so we cannot understand why lobsters were affected in particular. Some sized method 1 fish data was also not ingested where total does equal columns T-AV (non-sized inverts and sized columns) – eg. the lump fish in NFLD4 highlighted red in attached. Debris-zero was ingested but all other debris records were not (none of these had a value in the unsized-inverts column T) – no warning error given.

In addition when we tried to check Job 141, we could not view the ingested sheet through the UI, it comes up with no data: https://nrmn.aodn.org.au/data/job/141/edit

For Job 138. 20 rows were omitted from ingest. Many were for the species Austrolabrus maculatus but we are unsure why. Some of the “omitted” rows are actually true duplicates that are supposed to be summed on ingest. This still needs to be checked though as it looks like the totals of these duplicates are not being summed on ingest. User should receive a warning error for duplicate observations for same site/date/depth/method/block/species combos but if the user chooses to ignore this warning, data are to be summed. The original staged data and the ingested data are attached, with the highlighted rows in yellow needing investigating (green highlights are expected omissions) .

This is really concerning as international partners and PhD students are noticing these errors as they are currently trying to analyse their data, but we don’t know when this glitch began or ended – so we are not sure where to start investigating how many surveys are affected.

It is really important to know how far these issues extend as we need to know if whether to send out an email to all known data users if these errors can’t be resolved soon.

Cheers, Lizzi and Toni

bpasquer commented 5 months ago

Response from Bene 2024-04-19: Hi all,

We have looked at JOBID147 to investigate the raised issues:

"Out of the 123 rows staged, only 52 rows were ingested. 79 rows are missing which equates to 13,718 individuals/sightings. We have attached the ingested data file (RLS CANADA_2021_missingdata_ingested.xlsx) and the original staged data (RLS CANADA_2021_missingdata_stagedhighlighted.xlsx) for you to look at. Data (method 1 and 2) was not ingested where the total column did not equal the sum of columns T-AV (non-sized inverts and sized columns). The usual blocking validation error for this did not work – eg. rows highlighted yellow in attached"

We confirm that a significant portion of the data was not ingested and that the missing rows had are all Inverts (M2) + one M1 recorded species (Bolinopsis infundibulum- Northern comb jelly) where rows had no measurement recorded in the columns Inverts(T) or any of the size class. The blocking validation check on the "Total" failed to raise the issue. Note that currently the production UI raises a "Row contains no measurements" blocking check.

"Some sized method 2 and 0 invertebrate data was also not ingested where total does equal columns T-AV (non-sized inverts and sized columns) – eg. lobster highlighted orange in attached. Note that the job was staged under the RLS program without extended invert sizing, so we cannot understand why lobsters were affected in particular." Attached(RLS-CANADA-JOB147_ingested) is the extract of the observations ingested in the database. The extract contains the M0 and M2 lobster records.

"Some sized method 1 fish data was also not ingested where total does equal columns T-AV (non-sized inverts and sized columns) – eg. the lump fish in NFLD4 highlighted red in attached." The ingested data ((RLS-CANADA-JOB147_ingested) contains records for both M1 species highlighted in red under their valid names: Cyclopterdae lumpus and Macrozoarces americanus were respectively ingested as Cyclopterus lumpus and Zoarces americanus

"Debris-zero was ingested but all other debris records were not (none of these had a value in the unsized-inverts column T) – no warning error given." The ingested data does not contain Debris data of any type contrary to what the ingested sheet suggests. This is consistent with the missing data for records with no measurements issue mentioned above. The Debris -zero data shown in the ingested sheet results in no record in the DB because a value of '0' is recorded in the Inverts column. To be ingested the value should be set to '1'.

"In addition when we tried to check Job 141, we could not view the ingested sheet through the UI, it comes up with no data" I confirm there is an issue with the retrieval of the ingested sheet.

However, the uploaded sheet contains 70 data rows, which is the number of row that appears to have been ingested according to the ingest summary. Furthermore, the data base stores 96 records which is again the same number of records as in the uploaded sheet. This sheet has been correctly ingested.

Note that, unlike RLS_Canada_2021, this sheet had records in the Inverts column for all unsized observations including Debris data. As a result, all the data was correctly ingested, confirming that the problem with the RLS-Canada sheet is due to the absence of warning or blocking message to flag the absence of entry in the Inverts column.

To assess the impact of the absence of the blocking message, we are reviewing all raw datasheets ingested to date to pinpoint any sheets with missing measurements that might not have been processed correctly.

"For Job 138. 20 rows were omitted from ingest. Many were for the species Austrolabrus maculatus but we are unsure why. Some of the “omitted” rows are actually true duplicates that are supposed to be summed on ingest. This still needs to be checked though as it looks like the totals of these duplicates are not being summed on ingest. User should receive a warning error for duplicate observations for same site/date/depth/method/block/species combos but if the user chooses to ignore this warning, data are to be summed. The original staged data and the ingested data are attached, with the highlighted rows in yellow needing investigating (green highlights are expected omissions)" . Job138 issue requires further investigation as we haven't yet identified clear patterns to explain why the highlighted rows weren't ingested. Regarding your assessment on duplicates processing, I can confirm is that the current software version does:     - flag duplicate rows sum "true" duplicates in the endpoints

bpasquer commented 5 months ago

JOB138 issue recorded in https://github.com/aodn/nrmn-application/issues/1350

JOB 141 issue recorded in https://github.com/aodn/nrmn-application/issues/1355

bpasquer commented 5 months ago

Following up on JOBID147: see comment above: "To assess the impact of the absence of the blocking message, we are reviewing all raw datasheets ingested to date to pinpoint any sheets with missing measurements that might not have been processed correctly."

the listing of files containing row with no measurements was done and shared with the facility.

bpasquer commented 5 months ago

Further comments regarding JOB ID 147 that need further investigations: Lizzi: Some sized method 2 and 0 invertebrate data was also not ingested where total does equal columns T-AV (non-sized inverts and sized columns) – eg. lobster highlighted orange in attached. Note that the job was staged under the RLS program without extended invert sizing, so we cannot understand why lobsters were affected in particular. Bene: Attached(RLS-CANADA-JOB147_ingested) is the extract of the observations ingested in the database. The extract contains the M0 and M2 lobster records.

Lizzi: I’ve looked at this again and can confirm that the observations are present in the database (looking at the UI corrections tool for survey 923401379). However the ‘view ingested sheet’ feature does not stage or export these records, which is strange – is this supposed to reflect the data that got ingested or the last save of data in the staging platform (regardless it seems broken)? Perhaps this feature needs to be reviewed in reflection of this and the next points below. I noticed that the debris-zero records for this survey were actually not ingested for this survey (looking at the UI corrections tool for survey 923401379) even though they appear in the ‘ingested sheet’. I double checked in the debris extract and they are not present. The reason they were not ingested may be as for the point above (missing 0 in the inverts column) as you suggest in the last point, so perhaps it an issue of the ‘view ingested sheet’ feature being very inconsistent and not reflecting what was actually ingested – sometimes including things not ingested and sometimes excluding things that were!?