Clinical-Genomics / housekeeper

File data orchestrator
MIT License
2 stars 0 forks source link

Housekeeper bundle version naming #192

Open seallard opened 9 months ago

seallard commented 9 months ago

Bug

A bundle version consists of raw data (?) and data from an analysis of a case.

Currently, a version for a bundle is identified by a date in the paths to files in it. So if you try to create a new version for a bundle on the same day another version was created, nothing happens. The underlying assumption, that we only run analyses for cases on separate days, does not hold.

Steps to reproduce

  1. housekeeper add bundle sadicebear
  2. housekeeper add version sadicebear
  3. housekeeper add version sadicebear

Only one version exists after running these commands.

Suggested fix

Scheduled for technical refinement.

We could use a version number for the versions instead which resets for each bundle

sadicebear/1
sadicebear/2
happypenguin/1

Or we could use a GUID (global unique identifier) for versions

sadicebear/55b7209d-1c6c-450b-a0b9-e22aa01fb2ca
sadicebear/23b72404-efb0-4625-a5b6-a351762b7cb9
happypenguin/e8107896-d319-4ce4-92b2-30e95d882c6b

Or any other naming pattern which uniquely identifies a version for a bundle.

Notes

The bug was noticed by @beatrizsavinhas when restarting a microsalt analysis.

Microsalt cases that have been analyzed before fails to store (and report QC and upload) because there are already files in the bundle from the previous analysis the same day. This requires manual intervention, deleting the old bundle in housekeeper.

2024-02-06 08:45:38 hasta.scilifelab.se housekeeper.include[116038] INFO created new bundle version dir: /home/proj/production/housekeeper-bundles/bigkid/2024-02-05
2024-02-06 08:45:38 hasta.scilifelab.se cg.cli.workflow.commands[116038] ERROR Error storing deliverables for case bigkid - [Errno 17] File exists: '/home/proj/production/microbial/results/ACC13796_2024.2.5_15.58.22/sampleinfo.json' -> '/home/proj/production/housekeeper-bundles/bigkid/2024-02-05/sampleinfo.json'
2024-02-06 08:45:38 hasta.scilifelab.se cg.cli.workflow.commands[116038] ERROR Error storing bigkid: [Errno 17] File exists: '/home/proj/production/microbial/results/ACC13796_2024.2.5_15.58.22/sampleinfo.json' -> '/home/proj/production/housekeeper-bundles/bigkid/2024-02-05/sampleinfo.json'
seallard commented 9 months ago

This is a general bug affecting all bundles in housekeeper. The version name (which needs to be unique for a bundle) uses a date and if the bundle is created the same day, there will be a clash.

We are seeing this for microsalt cases since they are re-run the same day more commonly.

diitaz93 commented 9 months ago

A first thought: version numbers would be better because we can easily identify which one is the most recent one

beatrizsavinhas commented 9 months ago

A question on this is also: Would we like to keep all the files for all the runs? For example in microSALT we often correct reference organisms for some samples and rerun the analysis with the exact same data. In this case we are only interested in keeping the results of the corrected run and replace the files in housekeeper so there is no need to keep both version. We might want to leave the decision to delete the previous version to the user though.

seallard commented 9 months ago

Check with production. Alternatives:

Production (represented by @karlnyr): Use a version number.

Decision regarding patching old paths

Leave as is, would need to patch scout as well.

Logic to check and patch

Notes

When storing available cases, ensure to patch the logic so that we do not duplicate data if for whatever reason we attempt to store the same analysis twice.

Vince-janv commented 1 month ago

Closing due to inactivity. Reopen and answer the question below if you want this prioritised.

Concerning the proposed feature:

karlnyr commented 4 weeks ago