BiologicalRecordsCentre / iRecord

Repository to store and track enhancements, issues and tasks regarding the iRecord website.
http://irecord.org.uk
2 stars 1 forks source link

UKSI incremental update ordering problem #1712

Open burkmarr opened 1 month ago

burkmarr commented 1 month ago

I am opening this issue so that this doesn't fall off the radar. In the run-up to the recent UKSI incremental update (run on 12/06/2024) we discovered an issue with the ordering of a couple of operations. The incremental update code has logic for sequencing operations (described below) that works in almost all cases, but this case was an exception.

For the update we worked around it by changing the value of 'Processed_Date' for the extract name operation (156970) in the imported CSV to 29/03/2024 (from 28/03/2023), so that the promote name – op 155972 – happens before the extract name – op 156970. This was a temporary workaround.

Explanation of the problem from an email sent by @johnvanbreda:

155972 and 156970 – [...] these 2 operations do interact. The first promotes a name to be the accepted name for an organism and the second then extracts the synonym (formerly accepted name) into a new, separate organism. That’s the intention, but there is an issue with the sequencing so these operations are run the wrong way round. When we process, operations are run in order of their batch processing date (processed on column in the spreadsheet, batch_processed_on in our database), then within each batch operations are run in an order based on the operation type. Here’s the order of the tasks:

    'new taxon' => 1, 
    'extract name' => 2, 
    'amend taxon' => 3, 
    'promote name' => 4, 
    'rename taxon' => 5, 
    'merge taxa' => 6, 
    'add synonym' => 7, 
    'amend name' => 8, 
    'move name' => 9, 
    'deprecate name' => 10, 
    'remove deprecation' => 11, 

This means that for 2 operations in the same batch (as these are), the extract name will run before the promote name, irrespective of their order in the spreadsheet, so the extract fails as the name is still the accepted name.

John asked Chris Raper to check if this description of the task running order is correct, or if this is a problem in the history data. Chris' reply was:

The problem of operation order is a bit of a knotty one because the precise order of the operations has varied over time. Mike would set them up and then we'd run the imports and find that the operations needed tweaking to put some things higher or lower. I've been wracking my brains to work out if there is a way for us to work out or replicate the sequence of operations at the time the import was run and the only way I can think is to have a running number in the log that records the sequence that each operation was executed. I don't think it would be possible to record the exact row order because Mike updates them in blocks using single, large UPDATE (etc) commands. But I'm very rusty on my VBA so it might take a while to work it all out and it would probably require a new field.

So this problem will require some work first at Chris' end and then at ours to overcome. If it is not resolved by the time the next incremental update is run, we should be mindful that the problem could arise again and of the workaround used above.

I suggest that this issue remain open until a full solution is implemented.