AtlasOfLivingAustralia / data-management

Data management issue tracking
7 stars 0 forks source link

Duplicate records in AVH #419

Closed M-Nicholls closed 5 years ago

M-Nicholls commented 5 years ago

duplicate records were added instead of updating during the last set of loads to the AVH

e.g. institutionCode catalogNumber
CANB CANB 1.1
CANB CANB 101059.1
CANB CANB 101303.1
CANB CANB 101945.1
CANB CANB 102152.1
CANB CANB 102405.1
CANB CANB 10258.1
ansell commented 5 years ago

Downloaded the complete rowkey-uuid set from cassandra and will reduce that down to just https://collections.ala.org.au/public/showDataResource/dr2287 and https://collections.ala.org.au/public/showDataResource/dr376 and do some duplicates analysis on it.

The dump is currently on ala-cave at /data2/avh-duplicates-analysis/export.csv_merged

ansell commented 5 years ago

Attempting (slowly) to downloaded the complete uuid set (named occurrence_id in solr) set for data_hub_uid:dh9 from solr along with the last_load_date and last_processed_date to cross check the cassandra set. https://biocache.ala.org.au/ws/occurrences/facets/download?q=data_hub_uid:dh9&facets=occurrence_id,last_load_date,last_processed_date

The list from cassandra was obtained from the occ_uuid table which contains deleted records, and can be used to crosscheck the solr result that should not contain any deleted records.

ansell commented 5 years ago

The occurrences/facets query took too long and cancelled itself after 1 hour, so will look into another approach to get those three facets for the entire avh.

ansell commented 5 years ago

Used biocache offline downloads to download the basic fields for each of the three data resources that make up AVH based on the collectory data_hub_uid:dh9 metadata:

ansell commented 5 years ago

These are the statistics derived from the solr offline downloads for catalogNumber, collectionCode, and institutionCode:

data resource institutionCode collectionCode catalogNumber
dr2287 32 empty, 2 unique, 533886 records with values 97 empty, 2 unique (the same 2 unique values as institutionCode), 533821 records with values 25 empty, 533886 unique, 533893 records with values
dr376 1 empty, 15 unique, 5462039 records with values 748 empty, 19 unique, 5461292 records with values 1 empty, 4979401 unique, 5462039 records with values
dr2153 0 empty, 9 unique, 847142 records 0 empty, 9 unique, 847142 records (the same 9 unique values as institutionCode) 0 empty, 847142 unique, 847142 records

This analysis shows another anomaly with the empty institutionCode and catalogNumber, records not being fit for use as a primary key, as required by the collectory metadata.

ansell commented 5 years ago

Unfortunately, biocache does not have the row key in Solr due to an intentional design to hide the original row key, so will require some more analysis to match the Solr results back against Solr to find the original row keys that matched each of the worrying UUIDs.

nielsklazenga commented 5 years ago

@ansell thanks for all your work on this! I think the records with no value for collectionCode in dr376 (that's me) might be the result of an upload of PERTH type images outside the normal AVH aggregation process.

ansell commented 5 years ago

@nielsklazenga collectionCode doesn't appear to be part of the primary key based on looking at the metadata in collections.ala.org.au, so that should be okay in terms of record management. I added collectionCode to the analysis here to contrast it with institutionCode in the cases it was different just incase that adds some information for someone else reviewing these results.

For dr376, the 1 record with a missing institution code, and the 1 with a missing catalogNumber could be bigger issues as they are used for the primary key, and multiple records without both those values could have been silently merged/overwritten together inside of a single record and we currently have no provenance or analysis to tell us when that happens.

I check uniqueness of primary keys myself for datasets that I load manually, but we rely very heavily on the original accuracy of input data for automated loads--particularly delta loads--and the primary key being incorrect can silently introduce data corruption.

nielsklazenga commented 5 years ago

@ansell that's what I mean. The records I deliver always have institutionCode, collectionCode and catalogNumber. The records without values for these fields did not come from me and have been there for more than a year. See #326.

The data (at least the three columns we are talking about) in the deltas that were uploaded last Friday look okay to me. I will be very embarrassed if the problem is in the data I delivered.

ansell commented 5 years ago

Thanks, I hadn't noticed that issue, I will investigate it at the same time. The issue is likely in biocache-store and made worse by the system architecture choice made by a software developer to hide the original row keys which makes it difficult to identify these cases in any interface or tool.

ansell commented 5 years ago

I reran the analysis done for #379 on the current occ_uuid export, and it returned the same numbers, so it doesn't appear that there have been new uuids added for the corrupted row keys from Darwin Core Archives issue. biocache-store never deletes anything from occ_uuid by the software developers design, so they will continue to exist forever, possibly confusing future analysis.

$ grep '|Some(' /data2/avh-duplicates-analysis/originals/export.csv_merged | cut -d , -f 1 | cut -d \" -f 2 | cut -d "|" -f 1 | sort | uniq -c
  14962 dr2287
 430906 dr341
  49865 dr342
 404323 dr376
   5403 dr4218
    242 dr7965
     92 dr8216
  27968 dr836

Note that there was no numeric verification done on dr376 or dr2287 (or the unrelated dr342) after the reloads for #379 because of the choice that was made by Miles to not ask for a full dump and continue with delta loads only to save time.

The other affected data resources were all my loads and were all fully verified based on numbers to be consistent after the change. However, because the rowkeys have been made very difficult to verify and are not present in solr or the main cassandra column family, it is possible they are still broken also and only appear to work now because of the other records being deleted manually by me.

ansell commented 5 years ago

biocache-store is also built to re-persist errors in cassandra if they still exist in the Solr index. Not sure if that happened in this case, but it adds an extra difficulty to diagnosing how this issue was either not solved previously or reoccurred. In future, we should do complete reindexes immediately after fixing any row key issues, although that may not be enough if the following line of code is called during the load/sample/process sequence before indexing has a chance to occur:

https://github.com/AtlasOfLivingAustralia/biocache-store/blob/40a6ddf6fe518238df5a913071edc99a04e5555e/src/main/scala/au/org/ala/biocache/dao/OccurrenceDAOImpl.scala#L1205

ansell commented 5 years ago

Logging the number of new records was being hidden behind test mode. Will fix this so it emits the number of new records every time so we can identify these cases using the load log file if the data analyst reviews the log file:

https://github.com/AtlasOfLivingAustralia/biocache-store/blob/40a6ddf6fe518238df5a913071edc99a04e5555e/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L342

ansell commented 5 years ago

Deleted the records with Some( in their row key again, and now reprocessing/resampling/reindexing. Had to resample to fix Australian Capital Territory name in a spatial layer anyway. The deletion said it was deleting records from the index, but the counts for AVH didn't change. The manual analysis shows the high count should be clearly caused by the records that were sent through the deletion process, so no other thoughts on how to fix the mess at this point if it isn't fixed by the time that process ends tomorrow.

ansell commented 5 years ago

@nielsklazenga Can you generate a full dump so that we can load from it to see what the issue might be after we load from it?

ansell commented 5 years ago

I figured out the underlying issue. In the recent past there was a change to the "strip spaces in rowkey" setting for dr376. Because rowkeys are being thoroughly hidden by the developers design, this wasn't immediately visible except for when dumping an internal table and comparing the keys.

Once we get a full dump, we can easily fix the issue by deleting any records that are not in the latest load and we can move back to deltas again after that.

ansell commented 5 years ago

@nielsklazenga Do you know who would be best to contact about getting a full dump for dr2287 (BRI AVH) which should also fix its issues in the same way as dr376?

elywallis commented 5 years ago

@ansell the contact in Brisbane is Gill Brown. Gillian.Brown@des.qld.gov.au

nielsklazenga commented 5 years ago

@ansell I can easily create a data dump of the cache I have here, but that might be somewhat out-of-sync with the herbarium databases too. Is this easy for you to do? In other words, if we do it now with what I've got in the cache, can we do it early next year again with updated data?

ansell commented 5 years ago

@nielsklazenga Yes, a dump now and then another next year will be fine.

@elywallis Thanks!

nielsklazenga commented 5 years ago

@ansell I will have the data dump ready by the end of today. Has the automatic upload been turned of, or do you want to?

ansell commented 5 years ago

I will temporarily switch off the automatic job and manually do it this time. Thanks!

ansell commented 5 years ago

The process of getting a BRI AVH (dr2287) full data dump is being tracked in the ALA helpdesk under ticket 25327. Currently planning to do it in the new year, possibly January.

nielsklazenga commented 5 years ago

@ansell MEL AVH data dump is on the upload server now.

ansell commented 5 years ago

Verifying the dump I ran into a CSV syntax issue on line 3191507:

Line and header sizes were different: expected 73, found 74 headers=[ID, modified, institutionCode, collectionCode, basisOfRecord, occurrenceID, catalogNumber, recordNumber, recordedBy, lifeStage, reproductiveCondition, establishmentMeans, preparations, associatedSequences, associatedTaxa, occurrenceRemarks, previousIdentifications, eventDate, verbatimEventDate, habitat, continent, waterBody, islandGroup, island, country, countryCode, stateProvince, county, locality, verbatimLocality, minimumElevationInMeters, maximumElevationInMeters, verbatimElevation, minimumDepthInMeters, maximumDepthInMeters, verbatimDepth, locationRemarks, decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMeters, coordinatePrecision, verbatimCoordinates, verbatimLatitude, verbatimLongitude, verbatimCoordinateSystem, verbatimSRS, georeferencedBy, georeferencedDate, georeferenceProtocol, georeferenceSources, georeferenceVerificationStatus, georeferenceRemarks, identificationID, identificationQualifier, typeStatus, identifiedBy, dateIdentified, identificationRemarks, scientificName, kingdom, phylum, class, order, family, genus, specificEpithet, infraspecificEpithet, taxonRank, scientificNameAuthorship, nomenclaturalStatus, eventRemarks, bushBlitzExpedition] line=[10067742, 2014-09-04 16:09:07, HO, HO, PreservedSpecimen, , HO 316811, , 881, , , , , , , , , , , Brown, M.J., , , , , Australia, , , , Little Mt Michael.\, 740""", , 41, , , , , , , , , , , , , , , , , , , not available, , , , , , , , , , Asplenium flabellifolium Cav., , , , , Aspleniaceae, Asplenium, flabellifolium, , species, Cav., , , ]

That line appears as the following:

$ sed -n 3191507p /data2/avh-duplicates-analysis/processed/avh_all_2018-11-21/unit.csv 
10067742,"2014-09-04 16:09:07",HO,HO,PreservedSpecimen,,"HO 316811",,881,,,,,,,,,,,"Brown, M.J.",,,,,Australia,,,,"Little Mt Michael.\",740""",,41,,,,,,,,,,,,,,,,,,,"not available",,,,,,,,,,"Asplenium flabellifolium Cav.",,,,,Aspleniaceae,Asplenium,flabellifolium,,species,Cav.,,,

That seemed to imply that the backslash (\) character was being used as an escape character, even though the standard escape character is also being used on the same value.

Attempting again using \ as the (non-standard) escape character failed earlier on line 943820 where the escape character is used just before an end quote for a field

$ sed -n 943820p /data2/avh-duplicates-analysis/processed/avh_all_2018-11-21/unit.csv 
10545048,"2018-06-28 22:15:43",CANB,CBG,PreservedSpecimen,,"CBG 7904033.1",(108)77,"Chapman, C.",,,,sheet,,,,,1977-10-21,,"Grey sand over gravel.",,,,,Australia,AU,"Western Australia",,"5.7 km E of W Road end.","5.7 km E of W Road end.",,,,,,,,-29.8217,115.2694,,50000,,"29,49,18,S,115,16,10,E",,,,,,,,,,,,,,"Crisp, M.D. & Chandler, G.T.\",1996-11-01,,"Daviesia incrassata subsp. teres Crisp",,,,,Fabaceae,Daviesia,incrassata,,subsp.,Crisp,,,

@nielsklazenga Can you look into what the escape character setting should be in this case? It would be ideal if the " could be consistently used, as it is used sometimes but not in other cases.

nielsklazenga commented 5 years ago

@ansell The backslashes are in the source data that I harvested from other herbaria. They are not put in by my scripts, but they apparently do prevent the proper escape strings from being applied. Running some queries now to find out how many there are.

nielsklazenga commented 5 years ago

@ansell I found quite a few records with backslashes in one field or another, but I expect only backslashes before a quote (\") or at the end of the string are problematic. That leaves the following records:

catalogNumber field value
CANB 482516.1 identifiedBy Mast, A.R., Thiele, K.R.\
CBG 55359.1 recordNumber 339\
CBG 8503187.1 recordNumber 2731\
CBG 8906116.1 recordNumber 960\
CBG 8311322.1 recordNumber 12540A\
CBG 7807595.1 habitat Ridge in saddle. Sandy peat. Heath and open forest of Eucalyptus sp. with understorey of Baeckea, Melaleuca and Banksia spp. and button grass. \
CBG 8210554.1 locality 3.8 km N of Urunga on Pacific Highway.\
CBG 22600.1 locality CULTIVATED: Canberra Botanic Gardens, A.C.T. Sect. 5.\
CBG 7904033.1 identifiedBy Crisp, M.D. & Chandler, G.T.\
CBG 7804986.1 identifiedBy Crisp, M.D. & Chandler, G.T.\
CBG 7804982.1 identifiedBy Crisp, M.D. & Chandler, G.T.\
CBG 7805109.1 identifiedBy Crisp, M.D. & Chandler, G.T.\
CBG 7904028.1 identifiedBy Crisp, M.D. & Chandler, G.T.\
CBG 7904022.1 identifiedBy Crisp, M.D. & Chandler, G.T.\
CBG 7901897.1 identifiedBy Crisp, M.D. & Chandler, G.T.\
CBG 7901897.2 identifiedBy Crisp, M.D. & Chandler, G.T.\
CBG 7901899.1 identifiedBy Crisp, M.D. & Chandler, G.T.\
CBG 21918.1 identifiedBy Crisp, M.D.\
HO 316811 locality Little Mt Michael.\,740"""

Do you want me to create another data dump without these records?

ansell commented 5 years ago

We support backslashes when they are used consistently as the escape character for an entire file.

If backslashes are used, double quotes need to be escaped using backslashes\", and backslashes on their own need to be escaped using another backslash \\.

If double-quotes are used, only double-quotes need to be escaped using another double-quote "" and nothing else needs escaping. Using double-quotes is the preferred option and matches the CSV specification https://tools.ietf.org/html/rfc4180#section-2 but we can support either option if they are used consistently.

Changing how HO 316811 is serialised to properly escape the double quotes may make it possible to load without changing any other records, although I haven't tested the rest of the file after that point for other issues.

Broken (because there are both backslash and additional double-quote escaping in a single field):

"Little Mt Michael.\",740"""

Solution one with double-quote escaping:

"Little Mt Michael.\"",740"""

Solution two with double-quote escaping:

"Little Mt Michael."",740"""
ansell commented 5 years ago

The current plan negotiated with Niels is to patch the HO 316811 record inside of the ALA load sequence. This will require creation of a new jenkins job on ala-dury for preprocessing, followed by triggering a load of the patched file on aws-scjenkins. The patch is temporary while it is arranged for the original data provider to fix it, but the relocation of the main job to ala-dury may be permanent so that we can manage any future preprocessing there.

ansell commented 5 years ago

The new job on ala-dury is named MEL AVH

ansell commented 5 years ago

Loading failed with the following error (using a biocache-store version before the weekend update to using dwca-io:2.3):

aws-bstore-4b 2018-12-03 15:39:01,320 ERROR: [DataLoader] - The archive given is a folder with more or less than 1 data files having a csv, txt or tab suffix
org.gbif.dwca.io.UnsupportedArchiveException: The archive given is a folder with more or less than 1 data files having a csv, txt or tab suffix
    at org.gbif.dwca.io.ArchiveFactory.openArchive(ArchiveFactory.java:228)
    at au.org.ala.biocache.load.DwCALoader.loadArchive(DwCALoader.scala:183)
    at au.org.ala.biocache.load.DwCALoader$$anonfun$load$1.apply(DwCALoader.scala:126)
    at au.org.ala.biocache.load.DwCALoader$$anonfun$load$1.apply(DwCALoader.scala:117)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at au.org.ala.biocache.load.DwCALoader.load(DwCALoader.scala:117)
    at au.org.ala.biocache.load.Loader.load(Loader.scala:209)
    at au.org.ala.biocache.load.Loader$$anonfun$main$4.apply(Loader.scala:96)
    at au.org.ala.biocache.load.Loader$$anonfun$main$4.apply(Loader.scala:96)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at au.org.ala.biocache.load.Loader$.main(Loader.scala:95)
    at au.org.ala.biocache.cmd.CMD2$.main(CMD2.scala:133)
    at au.org.ala.biocache.cmd.CMD2.main(CMD2.scala)
Exception in thread "main" org.gbif.dwca.io.UnsupportedArchiveException: The archive given is a folder with more or less than 1 data files having a csv, txt or tab suffix
    at org.gbif.dwca.io.ArchiveFactory.openArchive(ArchiveFactory.java:228)
    at au.org.ala.biocache.load.DwCALoader.loadArchive(DwCALoader.scala:183)
    at au.org.ala.biocache.load.DwCALoader$$anonfun$load$1.apply(DwCALoader.scala:126)
    at au.org.ala.biocache.load.DwCALoader$$anonfun$load$1.apply(DwCALoader.scala:117)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at au.org.ala.biocache.load.DwCALoader.load(DwCALoader.scala:117)
    at au.org.ala.biocache.load.Loader.load(Loader.scala:209)
    at au.org.ala.biocache.load.Loader$$anonfun$main$4.apply(Loader.scala:96)
    at au.org.ala.biocache.load.Loader$$anonfun$main$4.apply(Loader.scala:96)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at au.org.ala.biocache.load.Loader$.main(Loader.scala:95)
    at au.org.ala.biocache.cmd.CMD2$.main(CMD2.scala:133)
    at au.org.ala.biocache.cmd.CMD2.main(CMD2.scala)

@sadeghim when you are back at work can you look into why this occurred. It is possible I have broken the archive when reassembling the zip file after patching the record.

This is what unzip -l showed for me when debugging the issue, which looks the same as for other archives that I have used the same process for in other jobs such as the ANWC job that also uses a Darwin Core Archive compressed as a zip file.

$ unzip -l /mnt/data/alaadmin/ala/dr376/Mapped-MEL_AVH.zip
Archive:  /mnt/data/alaadmin/ala/dr376/Mapped-MEL_AVH.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
     4895  2018-09-05 11:42   srv/jobs/dr376/processed/extracted/eml.xml
    18260  2018-09-05 11:42   srv/jobs/dr376/processed/extracted/meta.xml
2479201976  2018-12-03 14:40   srv/jobs/dr376/processed/extracted/unit.csv
---------                     -------
2479225131                     3 files
nielsklazenga commented 5 years ago

@ansell The thing you are 'Waiting for...', is that something I am supposed to provide?

ansell commented 5 years ago

Sorry, that is a fairly generic column on the Kanban board. In this case, it is because all of the three data resources (MEL_AVH, BRI_AVH, and NZVH) have been loaded from full dumps and are "waiting for" verification. That will be possible after the reindex tonight.

To do the verification, I will run the job to delete records that were not in the full dumps, and then verify the duplicate record analysis.

When I get to verifying it, it will move to the "QA" column on the Kanban board.

nielsklazenga commented 5 years ago

Thanks @ansell. Just happened to open the issue to see how it was tracking. Thanks for the information.

ansell commented 5 years ago

@nielsklazenga I tried to load MEL_AVH today from the full Darwin Core Archive and it is failing. We upgraded the GBIF dwca-io library in production this morning and it seems to be sensitive about the files in meta.xml being intact. Currently the MEL_AVH archive looks like it may be missing the extension file, identificationhistory.csv that is defined in the meta.xml file:

Exception in thread "main" org.gbif.dwc.UnsupportedArchiveException: DwC-A data file »identificationhistory.csv« does not exist
    at org.gbif.dwc.ArchiveFile.validate(ArchiveFile.java:127)
    at org.gbif.dwc.ArchiveFile.validateAsExtension(ArchiveFile.java:116)
    at org.gbif.dwc.Archive.validate(Archive.java:150)
    at org.gbif.dwc.DwcFiles.fromLocation(DwcFiles.java:29)
    at au.org.ala.biocache.load.DwCALoader.loadArchive(DwCALoader.scala:185)
nielsklazenga commented 5 years ago

@ansell I have always delivered the identification history (except with this data dump), but I was under the impression that ALA doesn't do anything with it. I can get you a new archive that includes the file, but, if ALA still doesn't do anything with the identification history, you might as well just remove that bit from the schema.xml.

ansell commented 5 years ago

Manually removing that part from the schema would be fun (but probably possible). It wasn't an issue in the past with the very old version of dwca-io that we had been using previously, just became an issue today with the upgrade. A new copy of the dump would probably be the most useful path if it is going to be there again in the future.

nielsklazenga commented 5 years ago

@ansell Okay, coming your way later today.

nielsklazenga commented 5 years ago

Hi @ansell, you can find the new data dump on the upload server. When loading it, could you make sure the option to delete existing values for fields that are delivered as empty cells is turned on? That will get rid of a lot of fields that we don't use anymore and some very early aggregation errors (or just incorrect provided values) as well.

ansell commented 5 years ago

@nielsklazenga The filename you added, determinationhistory.csv didn't match the file name in the meta.xml file, identificationhistory.csv. I manually patched it on my end but you should also fix it on your end.

I am not familiar with an option to delete existing values for fields that are delivered as empty cells. I will look into that.

nielsklazenga commented 5 years ago

@ansell, sorry for not getting back to you earlier. I will change the name of the identification history file in the meta.xml (or the other way around). @M-Nicholls will know what I am talking about re the null values, but it probably doesn't matter so much for now, as we will be doing this again in some form or another anyway.

nielsklazenga commented 5 years ago

@ansell. I had reason to look at AVH just then and the numbers of records for each herbarium look good. Can you confirm that the upload was successful?

Also, the reason I was looking was that someone told me that the links to AVH records in VicFlora were broken (these links are only in the editing interface, so this is not as bad as it sounds). Was it expected that the UUIDs would change?

ansell commented 5 years ago

Yes, the ALA UUIDs are not stable if the loaded identifiers change. In this case, the change involving space characters, occurred a while ago and at that time there was a hack put in to specifically support AVH. Both the first and second duplicate record bugs--biocache-store Darwin Core Archive key bug, and not including the removeSpaces hack after that--mean that the original UUIDs are not likely to match the current records. Any references into the ALA should be done using searches rather than the record URLs.

nielsklazenga commented 5 years ago

Thanks @ansell for all your work on this. I'll try a delta next week.

Putting in a search string is a good temporary solution for my issue in VicFlora. VicFlora is meant to reload to reload data from AVH every day, so the UUIDs should be stable enough for that, especially since I know when they are likely to change, but I never got around to quite finishing that bit.