datalad / datalad-dataverse

A DataLad (www.datalad.org) extension to work with Dataverse
Other
9 stars 15 forks source link

"Verification of content failed" for tsv&csv files #307

Closed behinger closed 19 hours ago

behinger commented 2 months ago

Hi! we are trying to use this bridge. Pushing & cloning works fine, but I get a strange error I cannot pinpoint.

I already followed some debugging instructions and now instead of datalad get, I run:

❯ git annex get sub-001_ses-001_scans.tsv -d

This ultimately leads to the following error


  Verification of content failed

  Unable to access these remotes: dataverse-storage

  Maybe add some of these git remotes (git remote add ...):
        b038e39d-5bac-487c-89db-503eacf4ce7b -- recorder@CCS-RecordingLaptop:~/bids_data/2024FreeViewingMSCOCO

  (Note that these git remotes have annex-ignore set: origin)
failed

The full output is below.

I did manually check the md5sum because initially I thought they surely will be different because of tabular-ingest or something. But no, the md5 online, after download and the hash in the filename are the same.

Now I'm out of ideas, and maybe someone can help me figure this out.

Best, Bene

PS: I updated today datalad + the plugin.

EDIT: Only CSV and TSV files are affected, e.g. JSON works fine.

Full log:

❯ git annex get sub-001_ses-001_scans.tsv -d
[2024-05-17 13:06:27.96725] (Utility.Process) process [4067583] read: git ["--git-dir=../../.git","--work-tree=../..","--literal-pathspecs","-c","annex.debug=true","ls-files","--stage","-z","--error-unmatch","--","sub-001_ses-001_scans.tsv"]
[2024-05-17 13:06:27.968104] (Utility.Process) process [4067584] chat: git ["--git-dir=../../.git","--work-tree=../..","--literal-pathspecs","-c","annex.debug=true","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)","--buffer"]
[2024-05-17 13:06:27.968681] (Utility.Process) process [4067585] chat: git ["--git-dir=../../.git","--work-tree=../..","--literal-pathspecs","-c","annex.debug=true","cat-file","--batch=%(objectname) %(objecttype) %(objectsize)","--buffer"]
[2024-05-17 13:06:27.969331] (Utility.Process) process [4067586] read: git ["--git-dir=../../.git","--work-tree=../..","--literal-pathspecs","-c","annex.debug=true","show-ref","git-annex"]
[2024-05-17 13:06:27.972217] (Utility.Process) process [4067586] done ExitSuccess
[2024-05-17 13:06:27.972673] (Utility.Process) process [4067587] read: git ["--git-dir=../../.git","--work-tree=../..","--literal-pathspecs","-c","annex.debug=true","show-ref","--hash","refs/heads/git-annex"]
[2024-05-17 13:06:27.975754] (Utility.Process) process [4067587] done ExitSuccess
[2024-05-17 13:06:27.976209] (Utility.Process) process [4067588] read: git ["--git-dir=../../.git","--work-tree=../..","--literal-pathspecs","-c","annex.debug=true","log","refs/heads/git-annex..cfabc61c3267fca15212d082b5e1e8d7a2b8a179","--pretty=%H","-n1"]
[2024-05-17 13:06:27.978769] (Utility.Process) process [4067588] done ExitSuccess
[2024-05-17 13:06:27.979181] (Utility.Process) process [4067589] read: git ["--git-dir=../../.git","--work-tree=../..","--literal-pathspecs","-c","annex.debug=true","log","refs/heads/git-annex..a153a0b7ae2b68ffd713d5d075c5d0535318c063","--pretty=%H","-n1"]
[2024-05-17 13:06:27.981642] (Utility.Process) process [4067589] done ExitSuccess
[2024-05-17 13:06:27.982455] (Utility.Process) process [4067591] chat: git ["--git-dir=../../.git","--work-tree=../..","--literal-pathspecs","-c","annex.debug=true","cat-file","--batch=%(objectname) %(objecttype) %(objectsize)","--buffer"]
get sub-001_ses-001_scans.tsv [2024-05-17 13:06:27.993057] (Utility.Process) process [4067593] chat: git ["--git-dir=../../.git","--work-tree=../..","--literal-pathspecs","-c","annex.debug=true","cat-file","--batch"]
(from dataverse-storage...) 
[2024-05-17 13:06:27.99924] (Utility.Process) process [4067594] chat: /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse []
[2024-05-17 13:06:28.242246] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] --> VERSION 1
[2024-05-17 13:06:28.242493] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] <-- EXTENSIONS INFO GETGITREMOTENAME ASYNC
[2024-05-17 13:06:28.242818] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] --> EXTENSIONS
[2024-05-17 13:06:28.242897] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] <-- PREPARE
[2024-05-17 13:06:28.243105] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] --> GETCONFIG url
[2024-05-17 13:06:28.243171] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] <-- VALUE https://darus.uni-stuttgart.de
[2024-05-17 13:06:28.243332] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] --> GETCONFIG doi
[2024-05-17 13:06:28.243397] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] <-- VALUE doi:10.18419/darus-4220
[2024-05-17 13:06:28.243525] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] --> GETCONFIG rootpath
[2024-05-17 13:06:28.243592] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] <-- VALUE 
[2024-05-17 13:06:28.243809] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] --> GETGITDIR
[2024-05-17 13:06:28.243863] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] <-- VALUE ../../.git
[2024-05-17 13:06:28.288472] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] --> GETCONFIG credential
[2024-05-17 13:06:28.288616] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] <-- VALUE 
[2024-05-17 13:06:28.830019] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] --> PREPARE-SUCCESS
[2024-05-17 13:06:28.830171] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] <-- TRANSFER RETRIEVE MD5E-s80--3ad5959c47a71181519fccddc7379467.tsv ../../.git/annex/tmp/MD5E-s80--3ad5959c47a71181519fccddc7379467.tsv
[2024-05-17 13:06:28.830445] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] --> GETSTATE MD5E-s80--3ad5959c47a71181519fccddc7379467.tsv
[2024-05-17 13:06:28.831084] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] <-- VALUE 294785
[2024-05-17 13:06:28.896249] (Annex.ExternalAddonProcess) /home/ehinger/micromamba/envs/dataverse/bin/git-annex-remote-dataverse[1] --> TRANSFER-SUCCESS RETRIEVE MD5E-s80--3ad5959c47a71181519fccddc7379467.tsv

  Verification of content failed

  Unable to access these remotes: dataverse-storage

  Maybe add some of these git remotes (git remote add ...):
        b038e39d-5bac-487c-89db-503eacf4ce7b -- recorder@CCS-RecordingLaptop:~/bids_data/2024FreeViewingMSCOCO

  (Note that these git remotes have annex-ignore set: origin)
failed
[2024-05-17 13:06:28.89784] (Utility.Process) process [4067591] done ExitSuccess
[2024-05-17 13:06:28.897892] (Utility.Process) process [4067585] done ExitSuccess
[2024-05-17 13:06:28.89793] (Utility.Process) process [4067584] done ExitSuccess
[2024-05-17 13:06:28.89797] (Utility.Process) process [4067583] done ExitSuccess
[2024-05-17 13:06:28.962896] (Utility.Process) process [4067594] done ExitSuccess
[2024-05-17 13:06:28.963696] (Utility.Process) process [4067593] done ExitSuccess
get: 1 failed
behinger commented 2 weeks ago

any update on this? I'm happy to spend ressources to try to debug this, but some tipps on what to look for would be nice.

shoeffner commented 4 days ago

I was able to reproduce this issue, find the root cause of this, and patch around it.

Root cause: During upload, dataverse converts various tabular data (R, csv, and more) to a specific tab-separated tab file. See also https://guides.dataverse.org/en/6.2/api/dataaccess.html#all-formats-bundled-download-for-tabular-files, and the very beginning of the guides:

By default, tabular files are downloaded in their “archival” form (tab-separated values). To download the original files (Stata, for example), add format=original as a query parameter.

In the case of your tsv files, they will probably only differ slightly, but for csv files, the download differs greatly. Either way, the hashes will not match between the uploaded and downloaded file.

So the solution is to set the format parameter. I believe for datalad-dataverse it should almost always be original, so we can change the call in dataset.py:

From e611697f466e605d8b21505a085de18380f109e6 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Sebastian=20H=C3=B6ffner?= <info@sebastian-hoeffner.de>
Date: Tue, 16 Jul 2024 03:46:22 +0200
Subject: [PATCH] Always use the original file format when downloading files.

Otherwise, the files will not have the correct hashes, as dataverse replaces, e.g., csv or R table files with its own tabular data format.
---
 datalad_dataverse/dataset.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/datalad_dataverse/dataset.py b/datalad_dataverse/dataset.py
index 7fb8f52..94d7d74 100644
--- a/datalad_dataverse/dataset.py
+++ b/datalad_dataverse/dataset.py
@@ -171,7 +171,7 @@ class OnlineDataverseDataset:
         # https://github.com/gdcc/pyDataverse/issues/49
         # the code below is nevertheless readied for such a
         # scenario
-        response = self.data_access_api.get_datafile(fid)
+        response = self.data_access_api.get_datafile(fid, data_format="original")
         # http error handling
         response.raise_for_status()
         with path.open("wb") as f:
-- 
2.45.2

I can create a proper PR and the reproduction steps tomorrow, if you want. Until then, @behinger, maybe you want to fix your local copy with that patch and see if it solves your problem?

While I was working on this issue, I noticed a few other minor issues: the demo.dataverse.org instance did not work with is_pid for me, I had to manually set it to False; and httpx does not support iter_content(...) but instead iter_bytes(...), plus the requests property ok on a response object is missing. I haven't looked at the history, but I guess this is because I used the main branch and not some release. Should I create follow-up issues?

adswa commented 4 days ago

sorry all for the lack of responses and time, and many many thanks for the analysis and patch! We've been struggling how to handle dataverse API changes in recent versions while maintaining compatibility with previous APIs (e.g., our own institution's dataverse is still on v4.x, and the format parameter seems to have been introduced in v5.x.). I'm adding this issue to our weekly meeting for a renewed discussion how to approach this best. Thanks both for creating momentum here. A PR and issues would be most welcome, though, the current state is clearly not ideal.

shoeffner commented 3 days ago

I created a couple of PRs now to fix some of the issues, however, some of that might not be compatible with dataverse 4.x anymore, maybe not even with 5.x. I am happy to change the PRs to try to be compatible with all versions, 4, 5, and 6. Unfortunately, the demo instance uses dataverse 6 and the docker containers seem to only offer a version 6 as well, so it would require quite some setup to test it against older instances. It's unfortunate that even though dataverse offers a versioned API, they do not seem to have made much use of it, not even for the documented breaking changes.

I am looking forward for the outcome of your weekly!