OccurrenceDAOImpl.getByRowKey and others actually require the ALA internal UUID

AtlasOfLivingAustralia / biocache-store

Occurrence processing, indexing and batch processing

Other

7 stars 24 forks source link

OccurrenceDAOImpl.getByRowKey and others actually require the ALA internal UUID #295

Closed ansell closed 2 years ago

ansell commented 5 years ago

The published methods in OccurrenceDAO that include byrowkey in their name actually require the ALA internal UUID, and fail if a public rowkey is given. This makes it impossible to debug rowkey issues using the OccurrenceDAO interface, with all attempts since switching to cassandra 3 to debug resorting to dumping the entire occ_uuid table and manually correlating. The previous behaviour in OccurrenceDAO should be restored and only methods that claim to get by UUID should accept the ALA internal UUIDs. The ALA Data Analysts require the ability to debug issues involving the original row keys. The database design (where joins based on the original row keys are not efficient/possible) and biocache-store must be fixed to enable this to occur.

https://github.com/AtlasOfLivingAustralia/biocache-store/blob/40a6ddf6fe518238df5a913071edc99a04e5555e/src/main/scala/au/org/ala/biocache/dao/OccurrenceDAOImpl.scala#L106

ansell commented 5 years ago

The natural behaviour was reversed when switching to occ_uuid, with the getByUuid methods now calling back to the getByRowKey methods instead of following the new order and switching to getByUuid as the last method call when it was made impossible to fetch records by their public rowkey.

ansell commented 5 years ago

When the row key comes in through the occurrenceID field, the ALA Internal UUID overwrites it when showing anything to users, including in the data resource exports that are sent to GBIF. We should be preserving access to everything the user sends us, particularly their primary key field, and the current system is designed to make that very difficult for those who have access to Cassandra (dumping and manually correlating a Cassandra table) and impossible for everyone else.

djtfmartin commented 5 years ago

For the occurrenceID, I think this is an indexing issues and a configuration of the downloads. We should be indexing the raw occurrenceID and exposing this in downloads - alongside our own UUID for the record, noting that noting that not all datasets provide an occurrenceID.

But this separate discussion / issue with the content of the occ_uuid and the synthetic key we create by combining 1 or more fields to create something unique within the dataset. This synthetic key that we create shouldn't be exposed via our API as it possible that it will change over time, while we can keep the UUID stable.

djtfmartin commented 5 years ago

The original occurrenceID is provided in darwin core downloads, is visible on occurrence pages, and is indexed and searchable via our API, so it isn't overwritten or hidden. We should probably add it to the legacy CSV downloads.

https://biocache.ala.org.au/ws/occurrences/search?q=occurrence_id:%22urn:catalog:NSW%20Office%20of%20Environment%20and%20Heritage:BioNet%20Atlas%20of%20NSW%20Wildlife:NSW38060%22

But there are still clearly stability issues with occurrenceID, particularly when data providers are still providing occurrenceIDs with whitespace like this:

urn:catalog:NSW Office of Environment and Heritage:BioNet Atlas of NSW Wildlife:ABBBS1410333

not to mention the use of government agency offices that have a habit of changing with governments or policies.

brucehyslop commented 2 years ago

biocache-store has been replaced by pipelines.