Open mgamble opened 1 year ago
Hey, thanks for raising this issue! I've never tried to create a RIA store on a crippled filesystem and access it with a non-crippled filesystem, but I know that cloning dataset checkouts in adjusted mode (which they are in if created on a crippled file system) to non-crippled file system requires tweaks (https://knowledge-base.psychoinformatics.de/kbi/0010/index.html).
I don't currently have access to a file system like the one on your HPC, so I'm just dumping a few speculations and observations here so others can join into the brainstorming until I or someone else can try it out. My initial suspicion was that (similar to the problem with dataset checkouts) the datasets in the RIA store reference the adjusted branch, which wouldn't get created on the non-crippled file system, but I see that your non-HPC clone references the master
branch correctly.
I see that the git-annex branch on the HPC system is on a different commit than the git-annex branch on your local machine.
I would be curious to learn what git-annex knows about the dataset on the crippled file system. On your local machine, in the clone, could you run
git-annex whereis lqad051copy.pdf
,git annex info <name-of-the-remote>
for the storage remote and maybe for completenessgit cat-file -p git-annex:remote.log
and report back?
Thanks in advance!The combination of crippledFS to non-crippled FS and RIA is certainly .... challenging. The former is already suboptimal, and RIA adds a few unknowns for me on top. RIA was developed more or less Unix-first, there is no test case like the one you describe. Originally, we had scheduled to robustify the feature with better crippled filesystem tests and support in mind for this month, but academic duties got our schedules out of plan. But getting behind this issue could be really useful to help to not repeat past mistakes with that code base. I'll see if I can find myself a setup that mirrors yours.
I sadly wasn't able to reproduce this by simply mounting a local drive with (a random) crippled filesystem.
Creating a ria store on crippled FS (an external hard drive with exfat):
(fdm-werkstatt) adina@muninn in /media/adina/exfat
❱ datalad create my-dataset-test
[INFO ] Detected a filesystem without fifo support.
| Disabling ssh connection caching.
[INFO ] Detected a crippled filesystem.
[INFO ] Entering an adjusted branch where files are unlocked as this filesystem does not support locked files.
[INFO ] Switched to branch 'adjusted/master(unlocked)'
create(ok): /media/adina/exfat/my-dataset-test (dataset)
(fdm-werkstatt) adina@muninn in /media/adina/exfat
❱ cd my-dataset-test
(fdm-werkstatt) adina@muninn in /media/adina/exfat/my-dataset-test on git:adjusted/master(unlocked)
❱ echo 12345 > file
(fdm-werkstatt) adina@muninn in /media/adina/exfat/my-dataset-test on git:adjusted/master(unlocked)
❱ datalad save
add(ok): file (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
save (ok: 1)
(fdm-werkstatt) adina@muninn in /media/adina/exfat/my-dataset-test on git:adjusted/master(unlocked)
❱ datalad create-sibling-ria -s ria --new-store-ok 'ria+file:///media/adina/exfat/my-ria-test'
[INFO ] create siblings 'ria' and 'ria-storage' ...
[INFO ] Fetching updates for Dataset(/media/adina/exfat/my-dataset-test)
update(ok): . (dataset)
update(ok): . (dataset)
[INFO ] Configure additional publication dependency on "ria-storage"
configure-sibling(ok): . (sibling)
create-sibling-ria(ok): /media/adina/exfat/my-dataset-test (dataset)
action summary:
configure-sibling (ok: 1)
create-sibling-ria (ok: 1)
update (ok: 1)
0.00 [00:01, ?/s] (fdm-werkstatt) adina@muninn in /media/adina/exfat/my-dataset-test on git:adjusted/master(unlocked)
❱ datalad push --to ria
copy(ok): file (file) [to ria-storage...]
publish(ok): . (dataset) [refs/heads/master->ria:refs/heads/master [new branch]]
publish(ok): . (dataset) [refs/heads/git-annex->ria:refs/heads/git-annex [new branch]]
action summary:
copy (ok: 1)
publish (ok: 2)
Cloning from exfat to ext4:
(fdm-werkstatt) adina@muninn in /tmp
❱ datalad clone 'ria+file:///media/adina/exfat/my-ria-test#82330f7e-140c-40e4-b5dc-febdb6098c85' from-exfat
[INFO ] Configure additional publication dependency on "ria-storage"
configure-sibling(ok): . (sibling)
install(ok): /tmp/from-exfat (dataset)
action summary:
configure-sibling (ok: 1)
install (ok: 1)
(fdm-werkstatt) adina@muninn in /tmp
❱ cd from-exfat
(fdm-werkstatt) adina@muninn in /tmp/from-exfat on git:master
❱ ls
file
(fdm-werkstatt) adina@muninn in /tmp/from-exfat on git:master
❱ datalad get file
get(ok): file (file) [from ria-storage...]
So its not the crippled file system per se that creates a problem. It might be the specific one you're running, or maybe an interaction with the protocol through which it is cloned...
Can you let me know which filesystem it is your HPC system runs? datalad wtf
reports this when you install psutils
.
And I assume that you clone the datasets via ssh?
hey @mgamble, just a quick ping if you have additional infos (see comment above)? I'm still struggling to reproduce this
What is the problem?
Our HPC uses a "crippled" linux filesystem (read/write permissions are not honored). I have no idea why, but I hear this is common for academic HPCs. If you create a datalad dataset with annexed content on this HPC, attach and push to a RIA sibling also on the HPC and then clone it to a local linux system (not crippled), you can not 'get' the annexed content on the local system, even though the dataset seems to clone just fine.
This is the error "get(error): lqad051copy.pdf (file) [not available; (Note that these git remotes have annex-ignore set: origin)]"
What steps will reproduce the problem?
1) create a dataset on a crippled filesystem and add an annexed file. 2) save the dataset 3) attach and push to a RIA sibling (on the crippled filesystem) 4) clone the dataset onto a local linux machine (non-crippled filesystem) 5) attempt to 'get' the annexed content. (won't work)
DataLad information
HPC specs: (local machine specs below)
datalad 0.18.4
WTF
configuration <SENSITIVE, report disabled by configuration>
credentials
datalad
dataset
dependencies
environment
extensions
git-annex
location
python
system
local machine specs
datalad 0.18.3
WTF
configuration <SENSITIVE, report disabled by configuration>
credentials
datalad
dataset
dependencies
environment
extensions
git-annex
location
python
system
Additional context
I love datalad but this is a big problem for my group, as we use both HPC and local computation and need a system that can work in both environments.
Thanks for working on this project and making such important contributions to reproducibility in academic science.
Have you had any success using DataLad before?
Yes, I've been using Datalad for about a year. It has been a steep learning curve but I have been able to make it work for me in most contexts. For example, any other usage of 'datalad get' works fine for me. I can create a dataset on my local machine, push it to a RIA sibling on the HPC. Then I can clone it from the RIA to my HPC home directory and it works just fine. Also I can clone it from the RIA to another local machine and it will work fine as well. In addition, I can add content to a dataset cloned onto the HPC, but originally made on a local machine, push it to the RIA and update the dataset on the local machine and 'get' will work just fine in that context.