Sketch of a fairgraph-based implementation

mih commented 2 years ago

fairgraph is a Python API for EBRAIN knowledge graph queries, developed by the EBRAINS community. https://fairgraph.readthedocs.io

This is intended to replace all prior query implementations.

Right now it is only usable in a Python session like so:

>>> from pathlib import Path
>>> from datalad_ebrains.fairgraph_query import FairGraphQuery
>>> fq=FairGraphQuery()
# and to bootstrap a dataset
>>> list(fq.bootstrap('a8932c7e-063c-4131-ab96-996d843998e9', Path('/tmp/kgdstryout')))

where a8932c7e-063c-4131-ab96-996d843998e9 is the ID of a knowledge graph Dataset or DatasetVersion in OpenMinds terminology.

https://search.kg.ebrains.eu/instances/a8932c7e-063c-4131-ab96-996d843998e9

This will identify the underlying Dataset, and subsequently traverse all known versions.

It will then generate a DataLad dataset with one commit per DatasetVersion. Given unchanged information on the side of the EBRAINS knowledge graph, this dataset generation is reproducible, meaning: running this code twice will generate the exact same dataset (down to the gitsha values).

Each commit will contain file pointers to the respective EBRAINS file repository, referring to all files in their particular version that are part of a particular DatasetVersion.

The version_innovation is used as the commit message, and the version_identifier is assigned as a tag to each release commit.

The screenshot shows the resulting dataset visualized with DataLad Gooey

Closes #2
Closes #3
Closes #16 (the demo dataset above is the "Julich-Brain maximum probability map in EBRAINS")
Closes #33

adswa commented 1 year ago

I tried the code above as a freshly registered user after exporting the token I got from datalad ebrains-authenticate. It took a while and threw some warning, and eventually ended in an error, but I feel like this error is on the server side.

In [4]: list(fq.bootstrap('a8932c7e-063c-4131-ab96-996d843998e9', Path('/tmp/kgdstryout')))
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:256: UserWarning: 'str' object has no attribute 'get'
  warnings.warn(str(err))
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:96: UserWarning: Field 'full_documentation' is required but was not provided.
  warnings.warn(errmsg)
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:256: UserWarning: data must be a list or dict
  warnings.warn(str(err))
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:96: UserWarning: Field 'value' should be of type (<class 'float'>,), not <class 'int'>
  warnings.warn(errmsg)
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:256: UserWarning: data must be a list or dict                                                          
  warnings.warn(str(err))                                                                                                                                                       
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:96: UserWarning: Field 'full_documentation' is required but was not provided.
  warnings.warn(errmsg)
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:96: UserWarning: Field 'value' should be of type (<class 'float'>,), not <class 'int'>
  warnings.warn(errmsg)
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:256: UserWarning: data must be a list or dict                                                          
  warnings.warn(str(err))                                                                                                                                                       
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:96: UserWarning: Field 'full_documentation' is required but was not provided.
  warnings.warn(errmsg)
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:96: UserWarning: Field 'value' should be of type (<class 'float'>,), not <class 'int'>
  warnings.warn(errmsg)
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:256: UserWarning: data must be a list or dict                                                          
  warnings.warn(str(err))                                                                                                                                                       
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:96: UserWarning: Field 'full_documentation' is required but was not provided.
  warnings.warn(errmsg)
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:96: UserWarning: Field 'value' should be of type (<class 'float'>,), not <class 'int'>
  warnings.warn(errmsg)
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:256: UserWarning: data must be a list or dict                                                          
  warnings.warn(str(err))                                                                                                                                                       
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:96: UserWarning: Field 'full_documentation' is required but was not provided.
  warnings.warn(errmsg)
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:96: UserWarning: Field 'value' should be of type (<class 'float'>,), not <class 'int'>
  warnings.warn(errmsg)
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:256: UserWarning: data must be a list or dict                                                          
  warnings.warn(str(err))                                                                                                                                                       
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:96: UserWarning: Field 'full_documentation' is required but was not provided.
  warnings.warn(errmsg)
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:96: UserWarning: Field 'value' should be of type (<class 'float'>,), not <class 'int'>
  warnings.warn(errmsg)
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:256: UserWarning: 'str' object has no attribute 'get'                                                  
  warnings.warn(str(err))                                                                                                                                                       
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:96: UserWarning: Field 'full_documentation' is required but was not provided.
  warnings.warn(errmsg)
/home/adina/env/ebrains/lib/python3.10/site-packages/fairgraph/fields.py:96: UserWarning: Field 'value' should be of type (<class 'float'>,), not <class 'int'>
  warnings.warn(errmsg)
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 list(fq.bootstrap('a8932c7e-063c-4131-ab96-996d843998e9', Path('/tmp/kgdstryout')))

File ~/repos/datalad-ebrains/datalad_ebrains/fairgraph_query.py:35, in FairGraphQuery.bootstrap(self, from_id, path)
     32 # TODO support a starting version for the import
     33 # TODO maybe derive automatically from a tag?
     34 for kg_dsver in kg_dsversions:
---> 35     yield from self.import_datasetversion(
     36         ds, kg_dsver.resolve(self.client))

File ~/repos/datalad-ebrains/datalad_ebrains/fairgraph_query.py:79, in FairGraphQuery.import_datasetversion(self, ds, kg_dsver)
     77 def import_datasetversion(self, ds, kg_dsver):
     78     self.clean_ds_worktree(ds)
---> 79     yield from self.import_files(ds, kg_dsver)
     80     self.import_metadata(ds, kg_dsver)
     81     yield from self.save_ds_version(ds, kg_dsver)

File ~/repos/datalad-ebrains/datalad_ebrains/fairgraph_query.py:103, in FairGraphQuery.import_files(self, ds, kg_dsver)
    102 def import_files(self, ds, kg_dsver):
--> 103     yield from ds.addurls(
    104         # Turn query into an iterable of dicts for addurls
    105         urlfile=self.get_file_records(ds, kg_dsver),
    106         urlformat='{url}',
    107         filenameformat='{name}',
    108         # construct annex key from EBRAINS supplied info
    109         #key='et:MD5-s{size}--{md5sum}',
    110         # we will have a better idea than "auto"
    111         exclude_autometa='*',
    112         # and here it would be
    113         #meta=(
    114         #    'ebrains_last_modified={last_modified}',
    115         #    'ebrain_last_modification_userid={last_modifier}',
    116         #),
    117         fast=True,
    118         save=False,
    119         result_renderer='disabled',
    120         return_type='generator',
    121     )

File ~/repos/datalad/datalad/interface/base.py:873, in _execute_command_(interface, cmd, cmd_args, cmd_kwargs, exec_kwargs)
    867 pass_summary = do_custom_result_summary \
    868     and getattr(interface,
    869                 'custom_result_summary_renderer_pass_summary',
    870                 None)
    872 # process main results
--> 873 for r in _process_results(
    874         # execution
    875         cmd(*cmd_args, **cmd_kwargs),
    876         interface,
    877         allkwargs['on_failure'],
    878         # bookkeeping
    879         action_summary,
    880         incomplete_results,
    881         # communication
    882         result_renderer,
    883         result_log_level,
    884         # let renderers get to see how a command was called
    885         allkwargs):
    886     for hook, spec in hooks.items():
    887         # run the hooks before we yield the result
    888         # this ensures that they are executed before
    889         # a potentially wrapper command gets to act
    890         # on them
    891         if match_jsonhook2result(hook, r, spec['match']):

File ~/repos/datalad/datalad/interface/utils.py:319, in _process_results(results, cmd_class, on_failure, action_summary, incomplete_results, result_renderer, result_log_level, allkwargs)
    312 # how many repetitions to show, before suppression kicks in
    313 render_n_repetitions = \
    314     dlcfg.obtain('datalad.ui.suppress-similar-results-threshold') \
    315         if sys.stdout.isatty() \
    316            and dlcfg.obtain('datalad.ui.suppress-similar-results') \
    317         else float("inf")
--> 319 for res in results:
    320     if not res or 'action' not in res:
    321         # XXX Yarik has to no clue on how to track the origin of the
    322         # record to figure out WTF, so he just skips it
    323         # but MIH thinks leaving a trace of that would be good
    324         lgr.debug('Drop result record without "action": %s', res)

File ~/repos/datalad/datalad/local/addurls.py:1395, in Addurls.__call__(urlfile, urlformat, filenameformat, dataset, input_type, exclude_autometa, meta, key, message, dry_run, fast, ifexists, missing_value, save, version_urls, cfg_proc, jobs, drop_after, on_collision)
   1393 else:
   1394     displayed_source = "<records>"
-> 1395     records = ensure_list(url_file)
   1396     colidx_to_name = {}
   1398 rows = None

File ~/repos/datalad/datalad/utils.py:736, in ensure_list(s, copy, iterate)
    724 def ensure_list(s, copy=False, iterate=True):
    725     """Given not a list, would place it into a list. If None - empty list is returned
    726 
    727     Parameters
   (...)
    734       iterate over it.
    735     """
--> 736     return ensure_iter(s, list, copy=copy, iterate=iterate)

File ~/repos/datalad/datalad/utils.py:717, in ensure_iter(s, cls, copy, iterate)
    715     return cls((s,))
    716 elif iterate and hasattr(s, '__iter__'):
--> 717     return cls(s)
    718 elif s is None:
    719     return cls()

File ~/repos/datalad-ebrains/datalad_ebrains/fairgraph_query.py:132, in FairGraphQuery.get_file_records(self, ds, kg_dsver)
    127 # get the repos base url by removing the query string
    128 # input is like: https://example.com/<basepath>?prefix=MPM-collections/13/
    129 # output is: https://example.com/<basepath>
    130 # the prefix is part of the file IRIs again
    131 dvr_baseurl = urlparse(dvr.iri.value)._replace(query='').geturl()
--> 132 for f in self.iter_files(dvr):
    133     f_url = f.iri.value
    134     # the IRI is not a valid URL(?!), we must quote the path
    135     # to make it such

File ~/repos/datalad-ebrains/datalad_ebrains/fairgraph_query.py:154, in FairGraphQuery.iter_files(self, dvr, chunk_size)
    152 cur_index = 0
    153 while True:
--> 154     batch = omcore.File.list(
    155         self.client,
    156         file_repository=dvr,
    157         limit=chunk_size,
    158         from_index=cur_index)
    159     for f in batch:
    160         yield f

File ~/env/ebrains/lib/python3.10/site-packages/fairgraph/base_v3.py:520, in KGObject.list(cls, client, size, from_index, api, scope, resolved, space, **filters)
    518 normalized_filters = normalize_filter(cls, filters) or None
    519 query = cls._get_query_definition(client, normalized_filters, space, resolved)
--> 520 instances = client.query(
    521     normalized_filters, query["@id"],
    522     space=space,
    523     from_index=from_index, size=size,
    524     scope=scope
    525 ).data
    526 for instance in instances:
    527     instance["@context"] = cls.context

File ~/env/ebrains/lib/python3.10/site-packages/fairgraph/client_v3.py:150, in KGv3Client.query(self, filter, query_id, space, instance_id, from_index, size, scope, id_key)
    148     return response
    149 else:
--> 150     return _query(scope, from_index, size)

File ~/env/ebrains/lib/python3.10/site-packages/fairgraph/client_v3.py:127, in KGv3Client.query.<locals>._query(scope, from_index, size)
    118 def _query(scope, from_index, size):
    119     response = self._kg_client.queries.execute_query_by_id(
    120         query_id=self.uuid_from_uri(query_id),
    121         additional_request_params=filter or {},
   (...)
    125         #restrict_to_spaces=[space] if space else None,
    126     )
--> 127     return self._check_response(response)

File ~/env/ebrains/lib/python3.10/site-packages/fairgraph/client_v3.py:112, in KGv3Client._check_response(self, response, ignore_not_found, error_context)
    110         return response
    111     else:
--> 112         raise Exception(f"Error: {response.error} {error_context}")
    113 else:
    114     return response

Exception: Error: code=500 message='Internal Server Error' uuid=None

mih commented 1 year ago

OK, this is unexpected, and very valuable information. The error is strange, but it may be related to a particular permission setup that my account has and your's doesn't. I will investigate. Thx!

adswa commented 1 year ago

A second attempt worked. Maybe there should be some form of automatic retry?

mih commented 1 year ago

OK, thanks for the update. I'll keep this open and looking into some form of mitigation.

datalad / datalad-ebrains

Sketch of a fairgraph-based implementation #35