Open lu-pl opened 4 weeks ago
A next-level example (with two array fields in the model, one of which can also be empty) for testing as soon as the basic example in https://github.com/acdh-oeaw/rdfproxy/issues/57#issue-2467510363 works.
A note for future query writers: all array fields which can be empty need to go in an OPTIONAL {}
clause. If rdfproxy should return such fields as []
or null
is a question for a future date.
A SPARQL result like this (Wikidata query):
gnd | nameLabel | educated_atLabel | work_name |
---|---|---|---|
119359464 | Schindel | Gebürtig | |
115612815 | Geiger | University of Vienna | Der alte König in seinem Exil |
115612815 | Geiger | University of Vienna | Unter der Drachenwand |
1136992030 | Edelbauer | University of Vienna | Das flüssige Land |
1136992030 | Edelbauer | University of Applied Arts Vienna | Das flüssige Land |
and model:
from rdfproxy import from_sparql
from pydantic import BaseModel
class Work(BaseModel):
name: Annotated[str, from_sparql(binding="work_name")]
class Person(BaseModel):
gnd: str,
surname: Annotated[str, from_sparql(binding="nameLabel")]
education: list[Annotated[str, from_sparql(binding="educated_atLabel")]]
works: list[Work]
should result in
[
{
"gnd": "119359464",
"surname": "Schindel",
"education": [],
"works": [
{
"name": "Gebürtig"
}
]
},
{
"gnd": "115612815",
"surname": "Geiger",
"education": ["University of Vienna"],
"works": [
{
"name": "Der alte König in seinem Exil"
},
{
"name": "Unter der Drachenwand"
}
]
},
{
"gnd": "1136992030",
"surname": "Edelbauer",
"education": ["University of Vienna", "University of Applied Arts Vienna"],
"works": [
{
"name": "Das flüssige Land"
}
]
}
]
Since grouping/array merging will be a very frequent use-case/route, it might be nice to be able to already mark the default 'grouping field' of a model inside the annotation (I've called it id
since this is how frontend might think about it, but any other parameter name will do):
from rdfproxy import from_sparql
from pydantic import BaseModel
class Person(BaseModel):
wikidataid: Annotated[str, from_sparql(binding="person", id=True)]
name: Annotated[str, from_sparql(binding="personLabel")]
jobs: list[Annotated[str, from_sparql(binding="jobLabel")]]
# all the Franz Mayers and their jobs
# if you group this result by anything but the wikidata entity/id, you're doomed!
SELECT ?person ?personLabel ?jobLabel
WHERE {
?person wdt:P735 wd:Q4925932; # Franz
wdt:P734 wd:Q13725587; # Mayer
wdt:P106 ?job.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
It should also be noted that not just the root model but also nested models might need an explicit id
!
Note: The field by which to group a list[BaseModel
field must be made explicit in order to achieve the intended behavior. This could be done by a query parameter (e.g. by repurposing the group_by
parameter already implemented in the SPARQLModelAdapter.query
method) or grouping could be specified in the model itself.
Specifying the group value in the model might be the better solution here because this would allow triggering grouping logic in nested models as well.
Sketch for grouping:
class Work(BaseModel):
title: str
class Person(BaseModel):
name: str
works: Annotated[list[Work], SPARQLGrouping("name")]
There could be a SPARQLGroup
as well as a FieldGroup
type for grouping by a SPARQL binding value or model field value respectively.
Grouping types must only be legal for list[BaseModel]
annotations.
Based on SPARQL bindings like
[
{"name": "x", "title": "a"},
{"name": "x", "title": "b"},
{"name": "y", "title": "c"}
]
the above should result in
[
{
"name": "x",
"works": [
{ "title": "a" },
{ "title": "b" },
]
},
{
"name": "y",
"works": [
{ "title": "c" },
]
}
]
In the case of a model with several list fields I can't think of a use-case where the lists should be grouped by different binding variables, it's basically always grouping by the 'id' of the parent model. So in order to avoid duplicate group-specifications like this:
class Work(BaseModel):
title: str
class Institution(BaseModel):
name: str
class Person(BaseModel):
name: str
works: Annotated[list[Work], SPARQLGroup("name")]
education: Annotated[list[Institution], SPARQLGroup("name")]
I think the grouping variable aka id should be marked once on the model itself, not in the annotations of the list
s.
On second thought, the grouping variable will typically be a URI identifier which is not always relevant for the frontend, so it should be possible to group by a binding variable without necessarily having to include it as a model field. How about introducing a magic word like __id__
(or __groupingvariable__
or whatever)?
class Work(BaseModel):
title: str
class Institution(BaseModel):
name: str
class Person(BaseModel):
# just used for grouping any list fields, not actually included in API result
__id__: from_sparql(binding='person')
name: Annotated[str, from_sparql(binding='personLabel')]
works: list[Work]
education: list[Institution]
In the case of a model with several list fields I can't think of a use-case where the lists should be grouped by different binding variables, it's basically always grouping by the 'id' of the parent model. So in order to avoid duplicate group-specifications like this:
class Work(BaseModel): title: str class Institution(BaseModel): name: str class Person(BaseModel): name: str works: Annotated[list[Work], SPARQLGroup("name")] education: Annotated[list[Institution], SPARQLGroup("name")]
I think the grouping variable aka id should be marked once on the model itself, not in the annotations of the
list
s.On second thought, the grouping variable will typically be a URI identifier which is not always relevant for the frontend, so it should be possible to group by a binding variable without necessarily having to include it as a model field. How about introducing a magic word like
__id__
(or__groupingvariable__
or whatever)?class Work(BaseModel): title: str class Institution(BaseModel): name: str class Person(BaseModel): # just used for grouping any list fields, not actually included in API result __id__: from_sparql(binding='person') name: Annotated[str, from_sparql(binding='personLabel')] works: list[Work] education: list[Institution]
Valid.
ClassVar
and "private" attributes are excluded from the model, so this wouldn't be a problem for model serialization, see Automatically excluded attributes.
I would prefer calling the attribute __grouping__
or something though, "id" is imo missleading because the value of this attribute will simply be a SPARQL binding name (not a binding value) and not something ID-y.
Grouping by model class attribute also raises the question about library interface consistency.
The solution with typing.Annotated
would be neat since typing.Annotated
is a sane choice for the explicit allocation behavior from #56 and is already implemented there.
Using typing.Annotated
for triggering grouping would also be more explicit, although potentially redundant.
I am not in favor of anything right now.
New example test case which features both parallel and nested lists:
# Austrian writers and their books
SELECT ?gnd ?nameLabel ?educated_atLabel ?work_name ?work ?viaf
WHERE {
?author wdt:P106 wd:Q36180; # is writer
wdt:P27 wd:Q40; # nationality Austrian
wdt:P734 ?name;
wdt:P800 ?work;
wdt:P227 ?gnd;
wdt:P569 ?dob.
?work wdt:P1476 ?work_name.
OPTIONAL { ?work wdt:P214 ?viaf. }
FILTER (?gnd = "119359464" || ?gnd = "1136992030" || ?gnd = '115612815')
OPTIONAL { ?author wdt:P69 ?educated_at. }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
from rdfproxy import SPARQLBinding
from pydantic import BaseModel
class Work(BaseModel):
name: Annotated[str, SPARQLBinding("work_name")]
# another list of a primitive type
viafs: list[Annotated[str, SPARQLBinding("viaf")]
class Author(BaseModel):
gnd: str
surname: Annotated[str, SPARQLBinding("nameLabel")]
works: list[Work]
# list of primitive types! might have to be implemented as a BaseModel class which
# resolves to a single JSON string (rather than an object with a single string field)
education: list[Annotated[str, SPARQLBinding("educated_atLabel")]]
expected output
[
{
"gnd": "119359464",
"surname": "Schindel",
"education": [],
"works": [
{
"name": "Gebürtig",
"viafs": []
}
]
},
{
"gnd": "115612815",
"surname": "Geiger",
"education": ["University of Vienna"],
"works": [
{
"name": "Der alte König in seinem Exil",
"viafs": [299260555, 6762154387354230970008]
},
{
"name": "Unter der Drachenwand",
"viafs": [2277151717053313900002]
}
]
},
{
"gnd": "1136992030",
"surname": "Edelbauer",
"education": ["University of Vienna", "University of Applied Arts Vienna"],
"works": [
{
"name": "Das flüssige Land",
"viafs": []
}
]
}
]
A model for the same query with nested BaseModels instead of primitive lists:
from rdfproxy import SPARQLBinding
from pydantic import BaseModel
class Institution(BaseModel):
name: Annotated[str, SPARQLBinding("educated_atLabel")]
class Viaf(BaseModel):
num: Annotated[str, SPARQLBinding("viaf")]
class Work(BaseModel):
class Config:
group_by = "work_name"
name: Annotated[str, SPARQLBinding("work_name")]
viafs: list[Viaf]
class Author(BaseModel):
class Config:
group_by = "nameLabel"
gnd: str
surname: Annotated[str, SPARQLBinding("nameLabel")]
works: list[Work]
education: list[Institution]
Note: Union types + default with typing.Annotated
are expressed e.g. like so
class Institution(BaseModel):
name: Annotated[str | None, SPARQLBinding("educated_atLabel")] = None
@kevinstadler This is why we couldn't get the example to work today. 🙈
The current implementation generates
[
{
"gnd": "119359464",
"surname": "Schindel",
"works": [
{
"name": "Geb\u00fcrtig"
}
],
"education": [
{
"name": null
}
]
},
{
"gnd": "115612815",
"surname": "Geiger",
"works": [
{
"name": "Der alte K\u00f6nig in seinem Exil"
},
{
"name": "Unter der Drachenwand"
}
],
"education": [
{
"name": "University of Vienna"
}
]
},
{
"gnd": "1136992030",
"surname": "Edelbauer",
"works": [
{
"name": "Das fl\u00fcssige Land"
}
],
"education": [
{
"name": "University of Vienna"
},
{
"name": "University of Applied Arts Vienna"
}
]
}
]
so an immediate todo is the array collection behavior for list
field types (other than list[BaseModel]
.
Intended behavior:
Given a SPARQL result set like the following
defining a model like
should result in
Notes
Currently similar behavior is available with
SPARQLModelAdapter.query
when supplying thegroup_by
parameter:This feature would require an alternative code path when an Iterable/a list of another model type is encountered as a model field type.
Careful consideration must be given to how exactly this should be implemented and how this could affect the current implementation.
Especially, when the field-based grouping logic should run must be considered. Running this as a post-processing hook would be a less invasive way to achieve the desired behavior and should scale reasonably well performance-wise.