Clinical-Genomics / housekeeper

File data orchestrator
MIT License
2 stars 0 forks source link

Adding files to an included bundle that are already in the target. #129

Open karlnyr opened 1 year ago

karlnyr commented 1 year ago

I attempted to add a file to a bundle but the file already existed in the path. A user should be able to add the file if it already exists within the bundle path.

For example file_1 on bundle_1 which has the root of /home/housekeeper-bundles and a version from June 2nd, 2023:

ls -l /home/housekeeper-bundles/bundle_1/2023-06-02/
file_1

When trying to add the file to the already included bundle - should it not just add the file link into the database?

henrikstranneheim commented 1 year ago

What behavior do we want?

karlnyr commented 1 year ago

Keep the original and include it in the database. I don't mind it being a force flag really - I believe that this situation only happens for manual stuff - so a force might be useful :)

ChrOertlin commented 1 year ago

Intuitively I would think that there should not be any files present in the housekeeper directories if they have not been added through the API.

Can we clarify the manual stuff this happens with? @karlnyr

ChrOertlin commented 1 year ago

moving description over from a duplicated issue: Description housekeeper add file fails, stating that the file already exists. However, when the specific bundle is retrieved with housekeeper get bundle it is shown to be empty. When looking at the bundle directory, the file is present in a version - so it should be listed for the bundle. It cannot be retrieved with housekeeper get file either.

The command below was run in the /home/proj/production/housekeeper-bundles/ADM1091A3/2018-06-05 directory:

for f in ; do housekeeper add file -t fastq -t H9GA6ADXX -b ADM1091A3 ./${f}; done 2023-06-13 09:47:51 hasta.scilifelab.se housekeeper.cli.core[37109] INFO Use root path /home/proj/production/housekeeper-bundles 2023-06-13 09:47:51 hasta.scilifelab.se housekeeper.cli.add[37109] INFO Running add file 2023-06-13 09:47:51 hasta.scilifelab.se housekeeper.store.api.handlers.read[37109] INFO Fetching bundle with name: ADM1091A3 Traceback (most recent call last): File "/home/proj/production/bin/miniconda3/envs/P_main/bin/housekeeper", line 8, in sys.exit(base()) File "/home/proj/production/bin/miniconda3/envs/P_main/lib/python3.7/site-packages/click/core.py", line 722, in call return self.main(args, kwargs) File "/home/proj/production/bin/miniconda3/envs/P_main/lib/python3.7/site-packages/click/core.py", line 697, in main rv = self.invoke(ctx) File "/home/proj/production/bin/miniconda3/envs/P_main/lib/python3.7/site-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/proj/production/bin/miniconda3/envs/P_main/lib/python3.7/site-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/proj/production/bin/miniconda3/envs/P_main/lib/python3.7/site-packages/click/core.py", line 895, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/proj/production/bin/miniconda3/envs/P_main/lib/python3.7/site-packages/click/core.py", line 535, in invoke return callback(*args, *kwargs) File "/home/proj/production/bin/miniconda3/envs/P_main/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func return f(get_current_context(), args, **kwargs) File "/home/proj/production/bin/miniconda3/envs/P_main/lib/python3.7/site-packages/housekeeper/cli/add.py", line 124, in file_cmd link_to_relative_path(version=version, file_path=file_path, root_path=context.obj[ROOT]) File "/home/proj/production/bin/miniconda3/envs/P_main/lib/python3.7/site-packages/housekeeper/include.py", line 63, in link_to_relative_path link_file(file_path=file_path, new_path=housekeeper_path, hardlink=True) File "/home/proj/production/bin/miniconda3/envs/P_main/lib/python3.7/site-packages/housekeeper/include.py", line 19, in link_file os.link(file_path.resolve(), new_path) FileExistsError: [Errno 17] File exists: '/home/proj/production/housekeeper-bundles/ADM1091A3/2018-06-05/ADM1091A3_L001_R1_001.fastq.gz' -> '/home/proj/production/housekeeper-bundles/ADM1091A3/2018-06-05/ADM1091A3_L001_R1_001.fastq.gz'

beatrizsavinhas commented 1 year ago

Though I agree that ideally we should avoid manually modifying the database, I have also found this issue and wondered if having a --force or --skip-hard-linking flag would be useful when doing manual work. The problem arose when manually processing old flow cells stored on disk but with missing files in the housekeeper bundle or filtering vcf files from balsamic cases that fail for having too many variants as it is often more straightforward to find the necessary input files already in the housekeeper bundle. I found a workaround by moving or generating the files on a different directory and then adding them to housekeeper.

Vince-janv commented 1 year ago

Suggested solution: Before hard linking the file, check if there is a file present. If so use Path.samefile() to compare them. If True, only add to the Database. This might be a bit cumbersome but avoids the problem of overwriting anything already in the bundle directory.