Closed mih closed 1 year ago
Hi Michael,
You might find fairgraph
helpful. This is a Python library which builds on top of ebrains-kg-core
but also knows about openMINDS schemas. Documentation here: https://fairgraph.readthedocs.io/en/latest
Example for listing repository contents:
In [1]: from fairgraph import KGClient
In [2]: import fairgraph.openminds.core as omcore
In [3]: client = KGClient(host="core.kg.ebrains.eu")
In [4]: dv = omcore.DatasetVersion.from_id("e472a8c7-d9f9-4e75-9d0b-b137cecbc6a2", client)
In [5]: files = omcore.File.list(client, file_repository=dv.repository)
In [6]: for file in files:
...: print(file.iri)
...:
https://object.cscs.ch/v1/AUTH_4791e0a3b3de43e2840fe46d9dc2b334/ext-d000017_DiFuMoAtlases_pub/64/DataDescriptor-DiFuMo(64-dimensions).pdf
https://object.cscs.ch/v1/AUTH_4791e0a3b3de43e2840fe46d9dc2b334/ext-d000017_DiFuMoAtlases_pub/64/labels_64_dictionary.csv
https://object.cscs.ch/v1/AUTH_4791e0a3b3de43e2840fe46d9dc2b334/ext-d000017_DiFuMoAtlases_pub/64/Licence-CC-BY.pdf
https://object.cscs.ch/v1/AUTH_4791e0a3b3de43e2840fe46d9dc2b334/ext-d000017_DiFuMoAtlases_pub/64/maps.nii.gz
https://object.cscs.ch/v1/AUTH_4791e0a3b3de43e2840fe46d9dc2b334/ext-d000017_DiFuMoAtlases_pub/64/3mm/resampled_maps.nii.gz
In [7]: files[0].download(".", client, accept_terms_of_use=True)
Out[7]: PosixPath('DataDescriptor-DiFuMo(64-dimensions).pdf')
@apdavison that looks fantastic! I will take a closer look shortly! Thanks much!
Hey @apdavison! I now had a chance to try this out, and it works just as advertized -- really cool! I particularly like the readily accessible content of properties, e.g. the type of a file hash
>>> f.hash
Hash(algorithm='MD5', digest='1dc869c088d4ebd615287fb79b5853b2')
I was hoping that you had also come up with a convention to derive local file paths from the IRIs of files, but I could not find something related to that. To be more specific:
I can know the repo a file is in:
>>> f.file_repository
KGProxy([<class 'fairgraph.openminds.core.data.file_repository.FileRepository'>], 'https://kg.ebrains.eu/api/instances/00932cbe-f90f-4968-a91f-da717c554320')
I can also know the IRI of the file, pointing into that repo
>>> f.iri
IRI(https://object.cscs.ch/v1/AUTH_4791e0a3b3de43e2840fe46d9dc2b334/ext-d000017_DiFuMoAtlases_pub/64/DataDescriptor-DiFuMo(64-dimensions).pdf)
But there seems to be no standard implementation to derive from this information that a suitable local path could be 64/DataDescriptor-DiFuMo(64-dimensions).pdf
.
It seems to require something like "longest-common-prefix". However, this would quickly become messy when a dataset incorporates files from different file repositories (which I understand is not done (yet?), but possible.
Are you aware of an implementation for that?
Thanks in advance!
Unrelated, but worth noting: fairgraph
also exhibits the really slow file repository queries (e.g. 20-30s for a 5-file repo, such as the one demo'ed above). I had suspected that I was somehow doing it suboptimally with my custom queries, but now it looks like a more general issue.
So far we have used hard-crafted queries. With
ebrains-kg-core
available, it makes sense to remove this complication and switch to standard queries provided by this API wrapper. Here are examples:Establish query setup:
Get info on a dataset (by ID):
Get info on a file repository (by ID from dataset version record)
At present it is unclear to me how to get the actual file repository content listing. The only way I see it to visit the
https://core.kg.ebrains.eu/vocab/lastSyncIRI
-type URL, which yields an XML-formatted response that has the container content list. There is likely a better way that does not require a different query/parsing paradigm.