HumanCellAtlas / dcp-cli

DEPRECATED - HCA Data Coordination Platform Command Line Interface
https://hca.readthedocs.io/
MIT License
6 stars 8 forks source link

dss download-manifest throws exceptions on duplicate files #485

Closed chmreid closed 4 years ago

chmreid commented 4 years ago

I am running the hca download-manifest command with a manifest and the --no-data flag. Pretty early on in the process, the hca utility begins to throw many, many FileExistsError exceptions, apparently because it is attempting to write and re-write and re-re-write the same JSON metadata files repeatedly.

Here is the command I am using to download the manifest:

hca dss download-manifest --manifest pancreas-female-short.tsv --replica aws --layout bundle --no-data

using pancreas-female.tsv.zip and shortening it via

head -n 151 pancreas-female.tsv > pancreas-female-short.tsv

This begins the download process, but at some point begins raising many FileExistsErrors like so:

INFO:hca:Skipping download of 'project_0.json' because it already exists at '.hca/v2/files_2_4/62/474d/62474de4b9fea0f4e87cc3e5bfe5fab858c3465d44c45f46568ae63524afdb4d'.
WARNING:hca:Download task failed: DSSFile(name='project_0.json', uuid='cddab57b-6868-4be4-806f-395ed9dd635a', version='2019-05-10T142452.095000Z', sha256='62474de4b9fea0f4e87cc3e5bfe5fab858c3465d44c45f46568ae63524afdb4d', size=6050, indexed=True, replica='aws')
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/charles/Downloads/test/vp/lib/python3.7/site-packages/hca/dss/__init__.py", line 581, in _download_and_link_to_filestore
    hardlink(file_store_path, file_path)
  File "/Users/charles/Downloads/test/vp/lib/python3.7/site-packages/hca/dss/util/__init__.py", line 44, in hardlink
    os.link(source, link_name)
FileExistsError: [Errno 17] File exists: '.hca/v2/files_2_4/62/474d/62474de4b9fea0f4e87cc3e5bfe5fab858c3465d44c45f46568ae63524afdb4d' -> '0d0d4aa1-7e35-44bd-8949-fcc6bae92dfd.2019-05-14T083819.435000Z/project_0.json'

INFO:hca:Skipping download of 'library_preparation_protocol_0.json' because it already exists at '.hca/v2/files_2_4/01/250f/01250ff5b0fcda00e8bc203e9dae7942b456f8c23bb8af4215c353367d1ad15a'.
WARNING:hca:Download task failed: DSSFile(name='library_preparation_protocol_0.json', uuid='3ab6b486-f900-4f70-ab34-98859ac5f77a', version='2019-05-10T142253.537000Z', sha256='01250ff5b0fcda00e8bc203e9dae7942b456f8c23bb8af4215c353367d1ad15a', size=928, indexed=True, replica='aws')
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/charles/Downloads/test/vp/lib/python3.7/site-packages/hca/dss/__init__.py", line 581, in _download_and_link_to_filestore
    hardlink(file_store_path, file_path)
  File "/Users/charles/Downloads/test/vp/lib/python3.7/site-packages/hca/dss/util/__init__.py", line 44, in hardlink
    os.link(source, link_name)
FileExistsError: [Errno 17] File exists: '.hca/v2/files_2_4/01/250f/01250ff5b0fcda00e8bc203e9dae7942b456f8c23bb8af4215c353367d1ad15a' -> '0d0d4aa1-7e35-44bd-8949-fcc6bae92dfd.2019-05-14T083819.435000Z/library_preparation_protocol_0.json'

Happens with files with names ilke sequencing_protocol_0.json, library_preparation_protocol_0.json, project_19.json, analysis_file_30.json, etc.

While I haven't narrowed down the cause of the issue, I believe it is because the hard links that the dcp-cli is creating are links to files with the same names, and the links are all being put into the same folder, so there are naming conflicts, causing the FileExistsError exceptions.


To reproduce:

mkdir temp && cd temp
virtualenv vp -p python3.7 && source vp/bin/activate
pip install hca
wget https://github.com/HumanCellAtlas/dcp-cli/files/3611654/pancreas-female.tsv.zip && unzip pancreas-female.tsv.zip
head -n 151 pancreas-female.tsv > pancreas-female-short.tsv
hca dss download-manifest --manifest pancreas-female-short.tsv --replica aws --layout bundle --no-data
jessebrennan commented 4 years ago

The fix here is to not raise this exception if the file is already linked in the filestore.

jessebrennan commented 4 years ago

My previous comment https://github.com/humancellatlas/dcp-cli/issues/485#issuecomment-555750738 is not actually relevant. This issue has the same underlying cause as #450. Therefore it is being resolved in PR #477.

@chmreid your steps to reproduce are not completely correct. running the script once is not sufficient to manifest this bug. Instead you have to run the script twice in a row. The second time you should expect to see these errors.

The cause is that if multiple threads are downloading the same file at the same time, the last thread to finish will overwrite the filestore entry, thus orphaning the links made into the previous entry by the treads that finish first. When the script is run a second time, it encounters these orphaned files, sees that they are not linked in the filestore, assumes that the user created them and fails so as to avoid overwriting a user created file.

tl;dr: closing in favor of #450

jessebrennan commented 4 years ago

@hannes-ucsc asked I reopen this so that you can track the progress of your issue.

chmreid commented 4 years ago

was this fixed? there is no mention of a fix in #450

it appears to be fixed, when I try to reproduce the error I only see info messages like INFO:hca:Skipping download of 'process_2.json' because it already exists at '.hca/v2/files_2_4/27/86bd/2786bdd0fa9ff3607184dae6e1340cb6073dd10aa599455791e5e752a8221506'. without the corresponding warning + traceback.

jessebrennan commented 4 years ago

Yes. I should have mentioned this ticket in the PR, but I forgot. My comment above still stands though, #450 still fixes the issue. The fix is in the 7.0.0 release which is why we closed the issue.