RDFLib / pySHACL

A Python validator for SHACL
Apache License 2.0
251 stars 64 forks source link

"validate" fails loading graphs when both are file paths, but mixed formats. #276

Open henrieglesorotos opened 1 week ago

henrieglesorotos commented 1 week ago

I am validating a graph at a .nt filepath, but using a .ttl set of shapes.

validate(PATH_1_NT, shacl_graph=PATH_2_TTL, allow_warnings=self.allow_warnings, debug=self.debug)

I tried to decipher what was going on in the load_from_source, but thought it would be quicker asking here.

Using python 3.10.

poetry show pyshacl

poetry show pyshacl
 name         : pyshacl
 version      : 0.28.1
 description  : Python SHACL Validator

dependencies
 - importlib-metadata >6
 - owlrl >=6.0.2,<7
 - packaging >=21.3
 - prettytable >=3.5.0
 - rdflib >=6.3.2,<8.0

Note I have to use 0.28.1 as I am on rdflib 7.0.

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/bin/edge-mlops:8 in <module>     │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/cli │
│ ck/core.py:1157 in __call__                                                                      │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/cli │
│ ck/core.py:1078 in main                                                                          │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/cli │
│ ck/core.py:1688 in invoke                                                                        │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/cli │
│ ck/core.py:1688 in invoke                                                                        │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/cli │
│ ck/core.py:1434 in invoke                                                                        │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/cli │
│ ck/core.py:783 in invoke                                                                         │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/cli │
│ ck/decorators.py:33 in new_func                                                                  │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/edg │
│ e_mlops/cli/kedro.py:130 in run                                                                  │
│                                                                                                  │
│   127 │   │   │   hook_cls = edge_mlops.utils.import_cls_by_fq_name(hook_cls)                    │
│   128 │   │   │   hook_args.get(ctx, hook_cls).update(**kwargs)                                  │
│   129 │                                                                                          │
│ ❱ 130 │   run_kedro_pipeline(                                                                    │
│   131 │   │   ctx,                                                                               │
│   132 │   │   env=env,                                                                           │
│   133 │   │   pipeline=pipeline,                                                                 │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/edg │
│ e_mlops/kedro/cli_tools.py:107 in run_kedro_pipeline                                             │
│                                                                                                  │
│   104 │   if append_cli_cmd_to_run_desc:                                                         │
│   105 │   │   hook_args.add_cmd_to_run_desc(ctx)                                                 │
│   106 │                                                                                          │
│ ❱ 107 │   kedro_runner.run_kedro_pipeline(                                                       │
│   108 │   │   env=env,                                                                           │
│   109 │   │   pipeline=pipeline,                                                                 │
│   110 │   │   project_path=project_path,                                                         │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/edg │
│ e_mlops/kedro/runner.py:29 in run_kedro_pipeline                                                 │
│                                                                                                  │
│   26 │   │   │   │   for hook_type, args in (hook_args or {}).items()                            │
│   27 │   │   │   },                                                                              │
│   28 │   │   ) as session:                                                                       │
│ ❱ 29 │   │   │   session.run(pipeline_name=pipeline, **(session_run_kwargs or {}))               │
│   30                                                                                             │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/edg │
│ e_mlops/kedro/kedro_tools.py:241 in run                                                          │
│                                                                                                  │
│   238 │   def run(self, pipeline_name: str = None, *args, **kwargs) -> Dict[str, Any]:           │
│   239 │   │   self.pipelines.append(pipeline_name)                                               │
│   240 │   │   self.context.pipeline = pipeline_name                                              │
│ ❱ 241 │   │   retval = super().run(pipeline_name, *args, **kwargs)                               │
│   242 │   │   self.context.pipeline = None                                                       │
│   243 │   │   return retval                                                                      │
│   244                                                                                            │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/ked │
│ ro/framework/session/session.py:436 in run                                                       │
│                                                                                                  │
│   433 │   │   )                                                                                  │
│   434 │   │                                                                                      │
│   435 │   │   try:                                                                               │
│ ❱ 436 │   │   │   run_result = runner.run(                                                       │
│   437 │   │   │   │   filtered_pipeline, catalog, hook_manager, session_id                       │
│   438 │   │   │   )                                                                              │
│   439 │   │   │   self._run_called = True                                                        │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/ked │
│ ro/runner/runner.py:103 in run                                                                   │
│                                                                                                  │
│   100 │   │   │   self._logger.info(                                                             │
│   101 │   │   │   │   "Asynchronous mode is enabled for loading and saving data"                 │
│   102 │   │   │   )                                                                              │
│ ❱ 103 │   │   self._run(pipeline, catalog, hook_manager, session_id)                             │
│   104 │   │                                                                                      │
│   105 │   │   self._logger.info("Pipeline execution completed successfully.")                    │
│   106                                                                                            │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/ked │
│ ro/runner/sequential_runner.py:70 in _run                                                        │
│                                                                                                  │
│   67 │   │                                                                                       │
│   68 │   │   for exec_index, node in enumerate(nodes):                                           │
│   69 │   │   │   try:                                                                            │
│ ❱ 70 │   │   │   │   run_node(node, catalog, hook_manager, self._is_async, session_id)           │
│   71 │   │   │   │   done_nodes.add(node)                                                        │
│   72 │   │   │   except Exception:                                                               │
│   73 │   │   │   │   self._suggest_resume_scenario(pipeline, done_nodes, catalog)                │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/ked │
│ ro/runner/runner.py:331 in run_node                                                              │
│                                                                                                  │
│   328 │   if is_async:                                                                           │
│   329 │   │   node = _run_node_async(node, catalog, hook_manager, session_id)                    │
│   330 │   else:                                                                                  │
│ ❱ 331 │   │   node = _run_node_sequential(node, catalog, hook_manager, session_id)               │
│   332 │                                                                                          │
│   333 │   for name in node.confirms:                                                             │
│   334 │   │   catalog.confirm(name)                                                              │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/ked │
│ ro/runner/runner.py:426 in _run_node_sequential                                                  │
│                                                                                                  │
│   423 │   )                                                                                      │
│   424 │   inputs.update(additional_inputs)                                                       │
│   425 │                                                                                          │
│ ❱ 426 │   outputs = _call_node_run(                                                              │
│   427 │   │   node, catalog, inputs, is_async, hook_manager, session_id=session_id               │
│   428 │   )                                                                                      │
│   429                                                                                            │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/ked │
│ ro/runner/runner.py:392 in _call_node_run                                                        │
│                                                                                                  │
│   389 │   │   │   is_async=is_async,                                                             │
│   390 │   │   │   session_id=session_id,                                                         │
│   391 │   │   )                                                                                  │
│ ❱ 392 │   │   raise exc                                                                          │
│   393 │   hook_manager.hook.after_node_run(                                                      │
│   394 │   │   node=node,                                                                         │
│   395 │   │   catalog=catalog,                                                                   │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/ked │
│ ro/runner/runner.py:382 in _call_node_run                                                        │
│                                                                                                  │
│   379 ) -> dict[str, Any]:                                                                       │
│   380 │                                                                                          │
│   381 │   try:                                                                                   │
│ ❱ 382 │   │   outputs = node.run(inputs)                                                         │
│   383 │   except Exception as exc:                                                               │
│   384 │   │   hook_manager.hook.on_node_error(                                                   │
│   385 │   │   │   error=exc,                                                                     │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/ked │
│ ro/pipeline/node.py:357 in run                                                                   │
│                                                                                                  │
│   354 │   │   # purposely catch all exceptions                                                   │
│   355 │   │   except Exception as exc:                                                           │
│   356 │   │   │   self._logger.error("Node '%s' failed with error: \n%s", str(self), str(exc))   │
│ ❱ 357 │   │   │   raise exc                                                                      │
│   358 │                                                                                          │
│   359 │   def _run_with_no_inputs(self, inputs: dict[str, Any]):                                 │
│   360 │   │   if inputs:                                                                         │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/ked │
│ ro/pipeline/node.py:346 in run                                                                   │
│                                                                                                  │
│   343 │   │   │   if not self._inputs:                                                           │
│   344 │   │   │   │   outputs = self._run_with_no_inputs(inputs)                                 │
│   345 │   │   │   elif isinstance(self._inputs, str):                                            │
│ ❱ 346 │   │   │   │   outputs = self._run_with_one_input(inputs, self._inputs)                   │
│   347 │   │   │   elif isinstance(self._inputs, list):                                           │
│   348 │   │   │   │   outputs = self._run_with_list(inputs, self._inputs)                        │
│   349 │   │   │   elif isinstance(self._inputs, dict):                                           │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/ked │
│ ro/pipeline/node.py:377 in _run_with_one_input                                                   │
│                                                                                                  │
│   374 │   │   │   │   f"{sorted(inputs.keys())}."                                                │
│   375 │   │   │   )                                                                              │
│   376 │   │                                                                                      │
│ ❱ 377 │   │   return self._func(inputs[node_input])                                              │
│   378 │                                                                                          │
│   379 │   def _run_with_list(self, inputs: dict[str, Any], node_inputs: list[str]):              │
│   380 │   │   # Node inputs and provided run inputs should completely overlap                    │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/edg │
│ e_mlops/utils.py:1306 in __call__                                                                │
│                                                                                                  │
│   1303 │                                                                                         │
│   1304 │   def __call__(self, *args, **kwargs):                                                  │
│   1305 │   │   try:                                                                              │
│ ❱ 1306 │   │   │   return self.func(*args, **kwargs)                                             │
│   1307 │   │   except KeyboardInterrupt as interrupt:                                            │
│   1308 │   │   │   raise RuntimeError("Interrupted") from interrupt                              │
│   1309                                                                                           │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/src/ja_customer_data_tools/pipelines/c │
│ onstraint_validation/nodes.py:57 in apply                                                        │
│                                                                                                  │
│   54 │   │   con_g = Graph()                                                                     │
│   55 │   │   con_g.parse(self.shapes_path, format="ttl")                                         │
│   56 │   │   print(len(con_g))                                                                   │
│ ❱ 57 │   │   conforms, results_graph, results_text = validate(g, shacl_graph=self.shapes_path    │
│   58 │   │   print(results_text)                                                                 │
│   59 │   │   return (results_text, conforms, report_graph)                                       │
│   60                                                                                             │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/pys │
│ hacl/entrypoints.py:142 in validate                                                              │
│                                                                                                  │
│   139 │   shacl_graph_format = kwargs.pop('shacl_graph_format', None)                            │
│   140 │   if shacl_graph is not None:                                                            │
│   141 │   │   rdflib_bool_patch()                                                                │
│ ❱ 142 │   │   loaded_sg = load_from_source(                                                      │
│   143 │   │   │   shacl_graph, rdf_format=shacl_graph_format, multigraph=True, do_owl_imports=   │
│   144 │   │   )                                                                                  │
│   145 │   │   rdflib_bool_unpatch()                                                              │
│                                                                                                  │
│ /Users/henri.egle-sorotos/repos/job-architecture-monorepo/.venv/lib/python3.10/site-packages/pys │
│ hacl/rdfutil/load.py:308 in load_from_source                                                     │
│                                                                                                  │
│   305 │   │   if not source_as_file and not source_as_filename and not open_source:              │
│   306 │   │   │   source_as_bytes = source                                                       │
│   307 │   else:                                                                                  │
│ ❱ 308 │   │   raise ValueError("Cannot determine the format of the input graph")                 │
│   309 │   if g is None:                                                                          │
│   310 │   │   if source_is_graph:                                                                │
│   311 │   │   │   target_g: Union[rdflib.Graph, rdflib.ConjunctiveGraph, rdflib.Dataset] = sou   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Cannot determine the format of the input graph
ashleysommer commented 1 week ago

@henrieglesorotos Thanks for the bug report.

I haven't yet looked into exactly what is causing the issue in load_from_source, however a quick workaround would be to manually specify the data graph format and shapes graph format to the validate command, like this:

validate(PATH_1_NT, shacl_graph=PATH_2_TTL, allow_warnings=self.allow_warnings, debug=self.debug, data_graph_format="ntriples", shacl_graph_format="turtle")

That will allow PySHACL to not need to try to guess which format each graph is before parsing it. (Though it definitely should be able to do that by the filename, so I think there is still a bug).

Unrelatedly:

Note I have to use 0.28.1 as I am on rdflib 7.0.

Curious, is there something in the rdflib 7.1 release that's preventing you from upgrading?

henrieglesorotos commented 6 days ago

Thanks @ashleysommer

I was getting the same error by specifying data_graph_format kwarg. However, I didn't realise that shacl_graph_format was a thing! I can try that.

For now I am just loading the graphs into mem using rdflib.

In terms of what's preventing me from upgrading - it's just the effort involved in bumping on a couple of internal deps we have. Nothing major.

ashleysommer commented 6 days ago

Thanks @ashleysommer

I was getting the same error by specifying data_graph_format kwarg. However, I didn't realise that shacl_graph_format was a thing! I can try that.

All of the available arguments to use for validate() are listed in the README file, under Python Module Use.

Specifically those for the data_graph_format and shacl_graph_format are under the heading "Some other optional keywords".