acdh-oeaw / rdfproxy

Functionality for mapping SPARQL query result sets to Pydantic models
GNU General Public License v3.0
2 stars 0 forks source link

Query Sanity Checking #116

Open lu-pl opened 1 month ago

lu-pl commented 1 month ago

rdfproxy should run certain checks on queries to fail as fast as possible in case of inapplicable queries to avoid inadvertent side effects or simply crashing later in the program flow.

It might be advisable to parse incoming queries with an external parsing tool which raise ParsingExceptions early.

Note that the rdflib SPARQL parser is not fully reliable for this, see #2960. rdflib.plugins.sparql.parser.parseQuery can however catch SPARQL syntax errors.

rdfproxy is designed to handle SELECT queries, passing anything but a SELECT query to a SPARQLModelAdapter will certainly crash (somewhere) but still might have inadvertent side effects.

So passing anything but a SELECT query should crash the program as fast as possible.

A SELECT query predicate can be defined using the rdflib SPARQL parser plugin:

from rdflib.plugins.sparql.parser import parseQuery

def _is_select_query(query: str) -> bool:
    _, query_type = parseQuery(query)
    return query_type.name == "SelectQuery"

Another option would be to use SPARQLWrapper internals for determining the query type (see SPARQLWrapper.Wrapper._parseQueryType l597) or to utilize the compiled regex patternSPARQLWrapper.pattern (see SPARQLWrapper.SPARQLWrapper l267)

I prefer the rdflib solution.

See #126.

lu-pl commented 6 days ago

Note: Currently, it is not entirely clear to me, where exactly this logic should run.

Options are e.g.

This would ensure that the query gets checked as soon as possible. Also model sanity checking according to #108 could run in the initializer, so query and model checking would both happen in one place and as soon as possible.

class SPARQLModelAdapter(Generic[_TModelInstance]):
    # ... 
    def __init__(
        self, target: str | SPARQLWrapper, query: str, model: type[_TModelInstance]
    ) -> None:
        self._query = get_checked_query(query)
        self._model = get_checked_model(model)
    # ...

This would encapsulate all query logic in a single class and would also communicate clearly, that QueryConstructor is designed to handle sanitized and checked queries only.