datalad / datalad-ebrains

DataLad extension to interface with the neuroinformatics platform of the Human Brain Project
Other
3 stars 5 forks source link

Unrecognized file repository pointer for private dataset in ebrains #58

Open alexisthual opened 1 year ago

alexisthual commented 1 year ago

Hi!

First thank you for the nice extension :blush: We (@bthirion @ferponcem @man-shu @ymzayek) are interested in downloading this dataset from ebrains: https://search.kg.ebrains.eu/instances/07ab1665-73b0-40c5-800e-557bc319109d

Although we authenticated with export KG_AUTH_TOKEN=`datalad ebrains-authenticate`, we still could not get the following command to work: datalad ebrains-clone 07ab1665-73b0-40c5-800e-557bc319109d ibc-test

The traceback is the following:

[INFO   ] scanning for unlocked files (this may take some time) 
ebrains-clone(impossible): [Unrecognized file repository pointer https://data-proxy.ebrains.eu/api/v1/buckets/d-07ab1665-73b0-40c5-800e-557bc319109d]                                                               
save(ok): . (dataset)                                                                                                                                                                                               
action summary:                                                                                                                                                                                                     
  ebrains-clone (impossible: 1)                                                                                                                                                                                     
  save (ok: 1)

Maybe we're missing something here. Happy to contribute to the docs if someone can help us find a solution to this! Thanks

alexisthual commented 1 year ago

We tried the same command today and got a code 500 error: [ERROR ] Error: code=500 message='Internal Server Error' uuid=None

mih commented 1 year ago

Hey, thanks for giving it a go!

re https://github.com/datalad/datalad-ebrains/issues/58#issue-1601394399: it looks a bit as if this was attempt with code prior https://github.com/datalad/datalad-ebrains/commit/736f542386bf6e24f557cbd15ac18d6c3822b7b0 -- if that is true, then updating to the most recent dev-snapshot should fix this particular issues. Please let me know.

We tried the same command today and got a code 500 error:

Sadly, this looks like https://github.com/datalad/datalad-ebrains/issues/36 -- there is no fix that I am aware of other than time. This situation typically lasts for a few days, and then the query endpoint (I assume) comes back to life.

I can replicate the behavior you are seeing. There error is happening here:

 94                 dv = omcore.DatasetVersion.from_id(id, self.client)
 95                 target_version = dv.uuid
 96                 # determine the Dataset from the DatasetVersion we got
 97  >>           ds = omcore.Dataset.list(self.client, versions=dv)[0]

where dv is as DatasetVersion() instance with dv.uui='07ab1665-73b0-40c5-800e-557bc319109d'.

As you can see, both queries in the snippet run through fairgraph, the first one succeeds, the seconds one causes HTTP500.

However, as this is only happening occasionally, albeit still annoyingly frequent, neither datalad-ebrains, not fairgraph seem to be at fault here (at least given my superficial understanding).

Maybe you could consider bringing this up in https://github.com/HumanBrainProject/fairgraph/issues, or some ebrains support channel?

mih commented 1 year ago

Oh, looking at the test runs of https://github.com/datalad/datalad-ebrains/pull/59 from Mar 3, it seems that the outage is already a few days long. That is the longest observed so far.

alexisthual commented 1 year ago

Thank you for the pointers, I'll try using the latest version!

Just to let you know, we also tried accessing the aforementioned dataset today using siibra and it worked well. My shallow understanding is that it also uses fairgraph under the hood, so it was hard for us to fully understand what was the real problem here.

apdavison commented 1 year ago

I've looked into this a bit with Oliver Schmid, the KG product owner. It seems likely that this problem originates because datalad is talking to the pre-production KG server (kg-ppd). This is the default for fairgraph (the motivation being that people should test their scripts against PPD before running against the production server), but this is not well documented, for which I apologise.

The fix would be here: https://github.com/datalad/datalad-ebrains/blob/main/datalad_ebrains/fairgraph_query.py#L34

self.client = KGClient(host="core.kg.ebrains.eu")
alexisthual commented 1 year ago

Oh great, thanks a lot for the investigation! Should someone open a PR with this fix?

mih commented 1 year ago

Thx @apdavison for determining the cause!

@alexisthual If you want to prep a quick PR that would be much appreciated. I have put it on my TODO otherwise. TIA!

alexisthual commented 1 year ago

Unfortunately, even when I use the latest commits pushed on main, the problem described in this issue is still there.

I tried different commands:

All 3 commands yield the same error as the one I reported in the first message of this issue:

[INFO   ] scanning for unlocked files (this may take some time) 
ebrains-clone(impossible): [Unrecognized file repository pointer https://data-proxy.ebrains.eu/api/v1/buckets/d-07ab1665-73b0-40c5-800e-557bc319109d]
save(ok): . (dataset)
action summary:
  ebrains-clone (impossible: 1)
  save (ok: 1)

Moreover, trying to access the link present in the error from my browser yields

{"status_code":401,"detail":"You are not authenticated. This resource might require authenticated access - please retry with providing an authentication header."}

which is probably normal since I didn't explicitly provide a token.

mih commented 1 year ago

Thanks for looking into it. I had a closer look, and the dataset's files are hosted "behind" the human data gateway. To my knowledge, the is no programmatic way to access such data directly. It involves requesting access by clicking a button on the web UI, receiving an email, clicking a link in that email.

Because of these complications, I had not attempted to check if a programmatic access is possible afterwards (also because the access permissions only last for 24h, so testing such functionality on a CI is not easily possible).

I have now requested and received access to this dataset, and will have a look.

mih commented 1 year ago

I have posted https://github.com/datalad/datalad-ebrains/pull/61 with a code sketch and my findings. After rectifying superficial issues in datalad-ebrains code, the next blocker is an empty fairgraph report on the files contained in this dataset.

If you happened to have any insight in this, please let me know. Thx!

alexisthual commented 1 year ago

Thanks for looking into this Michael! I was actually able to fetch the dataset through siibra so I thought there should be a way for us to do this with datalad-ebrains. In particular, I could get the list of available files and fetch one file. Should we try to get some inspiration from what they do? I'm happy to schedule a call if you think that'd be of any help!

mih commented 1 year ago

AFAIK siibra also uses fairgraph. So it should be possible. I am currently not able to commit to a call, but if you can point people here, we should be able to figure it out asynchronously. thx!

ymzayek commented 1 year ago

Hello, I'm not sure how siibra depends on fairgraph but they use the method siibra.fetch_ebrains_token() to produce a link where you can authenticate and then you can pass the dataset id to siibra.retrieval.repositories.EbrainsHdgConnector() and use another method to search the files. When I did this a few days ago I remember after the second step I got an email where I had to click a link to get access. I just tried again now and didn't have to do this step so I'm not sure how long the access is provided after this step but it seems more than 24 hours. I am looping @dickscheid in here because maybe he can clarify better how siibra does this and its relationship with fairgraph and datalad extension.

Full code to reproduce the data fetching described above:

import siibra

siibra.fetch_ebrains_token()
from siibra.retrieval.repositories import EbrainsHdgConnector
dataset_id = "07ab1665-73b0-40c5-800e-557bc319109d" # The ID is the last part of the url
conn = EbrainsHdgConnector(dataset_id)
conn.search_files()
data_file = "resulting_smooth_maps/sub-01/ses-14/sub-01_ses-14_task-MTTWE_dir-ap_space-MNI152NLin2009cAsym_desc-preproc_ZMap-we_all_event_response.nii.gz"
img = conn.get(data_file)
mih commented 1 year ago

@ymzayek Thanks for the code snippet. That is very helpful. We should be able to reuse the auth-setup in datalad-ebrains also for calling out to siibra. I will check what it is doing and if that does not inform a code change in the fairgraph call, we could simply employ siibra for this case.

I am not sure whether a non-public data-proxy bucket link always implies the human data gateway, but until we discover counter-evidence, this may be good enough.

mih commented 1 year ago

So looking at https://github.com/FZJ-INM1-BDA/siibra-python/blob/908f118f87ec83def2970d9a526f29f49482e2bc/siibra/retrieval/repositories.py#L354-L449 I see that siibra queries the data-proxy directly, and does not go through the knowledge graph! It does go through the KG for public datasets.

Now I am wondering: We could do the same thing. Moreoever, doing it not only for non-public datasets, like the example here, but for any data-proxy accessible dataset may actually solve #52. If that is True, it would boost overall performance by quite a bit!

alexisthual commented 1 year ago

@mih, @ymzayek and I are interested to look more into this but it feels a bit hard to dive into this codebase on our own. However, we'd be down to schedule a peer-coding session with you some day soon if you are interested in that too! Otherwise, we can try to deal with this asynchronously, but we'll need some of your guidance haha

mih commented 1 year ago

@alexisthual @ymzayek That would be wonderful. We have a regular zoom call for such things Tue's 8:30 CET. If this would work for you, that would be the easiest, and @dickscheid would also be in that call.

alexisthual commented 1 year ago

Nice! 8:30 am might be a bit early (the office is rather far haha) but I think I can try and make it next Tuesday :slightly_smiling_face:

ymzayek commented 1 year ago

I think I should be able to make it for next Tuesday as well.

mih commented 1 year ago

Awesome! Apologies for the timing. This is pretty much 11am if-there-would-be-nothing-stupid-to-do o'clock. Please shoot me an email at michael.hanke@gmail.com, and I will send you a zoom link. Thx for your interest!

mih commented 1 year ago

61 has progressed a bit with today's meeting, but is not yet in a functional state.

@dickscheid pointed out the HDG documentation should have all missing information https://wiki.ebrains.eu/bin/view/Collabs/data-proxy/Human%20Data%20Gateway/

It might require a dedicated implementation of a downloader. This should be fairly straightforward with the UrlOperations framework from datalad-next.

alexisthual commented 1 year ago

Hi @mih and @dickscheid ! Do you have some time in the coming days or weeks to work some more on this? We'd happily participate in a new peer-coding session if it can help.

alexisthual commented 1 year ago

Hi!

We've (@man-shu @ferponce @bthirion) tried using datalad-next directly with urls from EBRAINS data-proxy (which, from our understanding, allows users to directly use URLs) but did not succeed in getting the data. We have also tried to access a bucket directly from the data proxy, but it seems that our dataset does not have a bucket yet (and we don't know if it'd be useful).

We did not try to integrate these changes in datalad-ebrains but would happily participate in a peer-coding session if that sounds useful!

alexisthual commented 1 year ago

Hi @mih! I see there has not been much movement on this repo for the past few weeks. We are still interested in this feature but could not get it to work on our end :blush: The HBP is coming to an end, and I don't know if you'll spend more time on this repo ; in any case, let us know if we can help with anything!

mih commented 1 year ago

I had the chance to work on this again. #61 refactors the code to allow for interacting with the data proxy API directly. Moreover, it switches access for publicly hosted datasets that are accessible via the DP to use that API too.

I could not get the authentication flow for private dataset access via the HDG to work -- neither in code, nor with https://data-proxy.ebrains.eu/api/docs

I use my EBRAINS token to authenticate. When I POST to /datasets/{dataset_id} to initiate the HDG flow, as instructed at https://wiki.ebrains.eu/bin/view/Collabs/data-proxy/Human%20Data%20Gateway/ I get

{
  "status_code": 401,
  "detail": "User not authenticated properly. The token needs to access the 'roles', 'email', 'team' and 'profile' scope."
}

The correpsonding GET request fails (as expected) with

{
  "status_code": 401,
  "detail": "Access has expired, please request access again",
  "can_request_access": true
}

This makes me think that either the EBRAINS session token is the wrong credential here, or that my particular account is insufficient, or I am missing a crucial step in the authorization flow.

@alexisthual if you can get a file listing of a HDG dataset via https://data-proxy.ebrains.eu/api/docs please let me know how, and I am confident that I can achieve the rest.

ymzayek commented 1 year ago

Not sure this is helpful but I also tried this. From the browser I can access this private dataset https://data-proxy.ebrains.eu/datasets/07ab1665-73b0-40c5-800e-557bc319109d (authorized through login and email link). Then using that same token:

curl -X 'POST' \
  'https://data-proxy.ebrains.eu/api/v1/datasets/07ab1665-73b0-40c5-800e-557bc319109d' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer $TOKEN' \
  -d ''

I get

{
  "error": "Error accessing userinfo"
}

And same response with GET request