Change harvester approach to update instead of delete and reinsert

WingLongitude / lontra-harvester

Lontra is a tool used as a Harvester to ingest biodiversity data

MIT License

1 stars 3 forks source link

Change harvester approach to update instead of delete and reinsert #9

Closed tigreped closed 9 years ago

tigreped commented 9 years ago

The lontra harvester currently deletes occurrence data and inserts upon re-indexing. Support update operations instead, to allow auto_id maintenance.

cgendreau commented 9 years ago

Yes as designed. There is more than one reason for that: -How do you retrieve the old record? Which identifier will you use? -Performance will also be an issue considering you need to find the previous version before. -You also need to consider the identifier itself can change.

I don't see "keep the auto_id" as a good reason for implementing updates but keeping the history per record, maybe.

tigreped commented 9 years ago

Identifier: sourcefileid/resource_uuid + dwcaid do not suffice? Anyway, another option is, for history keeping, adding a previous occurrence auto_id field, in order to link the new record to its older version, and instead of updating, keep inserting the new one. What do you think?

cgendreau commented 9 years ago

resource-uuid+dwcaid is the best we can do. This whole thing requires proper db and software design. The history will require unstructure storage (e.g. jsonb) to ensure future schema changes won't require history update.

cgendreau commented 9 years ago

One idea In liger

Change occurrence PK to a composed key (resource-uuid+dwcaid)
Add field to occurrence to identify removed records(e.g. last_modified)

In lontra

Don't remove previous records
We already use Hibernate's saveOrUpdate so updates will already work
Delete records that are not in the archive anymore using the new last_modified.

tigreped commented 9 years ago

That could work, but I guess some won't be happy with a url such as /occurrences/388f0f20-0d2a-4b81-b91f-6a3ce9ac6e52urn:catalog:inpa:herbarium:805. Having said that, I think we should schedule a hangout to further discuss possibilities. Other concern is that maybe keeping the entire history of the records isn't the best approach now with postgres, also because we can access this information by the versions of the resource in the IPT link. The problem for me is simple: maintain the first auto_id generated for an occurrence as it is updated (new versions of the occurrence is inserted). That way I can preserve the URL. In that case, the best immediate solution would be 1) delete the record but keep its auto_id in a new field in the occurrence table (e.g. "parent_auto_id"); 2) update the entire row, which will lead to performance reduction, but only in update cases, I guess it is probably tolerable.

cgendreau commented 9 years ago

You misunderstand the point. URL, update and history SHALL be 3 different issues. I never said auto_id should be removed, simply it should not be the PK anymore.

tigreped commented 9 years ago

Me neither, I understand the idea of changing the PK to the compound key resuorce_uuid + dwcaid and keep auto_id, but I guess I don't understand which of the 3 issues it is supposed to solve.

tigreped commented 9 years ago

Issue solved, as SiBBr will revert its behavior to be the same as Canadensys.

cgendreau commented 9 years ago

We won't update records for now but we will keep the id between harvest when possible. See Issue #14