acdh-oeaw / rdfproxy

GNU General Public License v3.0

0 stars 0 forks source link

Feature: Automatic grouping behavior based on model fields #57

Open lu-pl opened 4 weeks ago

lu-pl commented 4 weeks ago

Intended behavior:

Given a SPARQL result set like the following

name	title
Thomas Bernhard	Der Untergeher
Thomas Bernhard	Auslöschung
Thomas Bernhard	Korrektur
Thomas Bernhard	Holzfällen
Thomas Bernhard	Heldenplatz

defining a model like

from pydantic import BaseModel

class Work(BaseModel):
    title: str

class Person(BaseModel):
    name: str
    works: list[Work]

should result in

[
  {
    "name": "Thomas Bernhard",
    "works": [
      { "title": "Der Untergeher" },
      { "title": "Auslöschung" },
      { "title": "Korrektur" },
      { "title": "Holzfällen" },
      { "title": "Heldenplatz" },
    ]
  }
]

Notes

Currently similar behavior is available with SPARQLModelAdapter.query when supplying the group_by parameter:

{
    "Thomas Bernhard": [
        {
            "name": "Thomas Bernhard",
            "work": {
                "title": "Der Untergeher"
            }
        },
        {
            "name": "Thomas Bernhard",
            "work": {
                "title": "Ausl\u00f6schung"
            }
        },
        {
            "name": "Thomas Bernhard",
            "work": {
                "title": "Korrektur"
            }
        },
        {
            "name": "Thomas Bernhard",
            "work": {
                "title": "Holzf\u00e4llen"
            }
        },
        {
            "name": "Thomas Bernhard",
            "work": {
                "title": "Heldenplatz"
            }
        }
    ]
}

This feature would require an alternative code path when an Iterable/a list of another model type is encountered as a model field type.

Careful consideration must be given to how exactly this should be implemented and how this could affect the current implementation.

Especially, when the field-based grouping logic should run must be considered. Running this as a post-processing hook would be a less invasive way to achieve the desired behavior and should scale reasonably well performance-wise.

kevinstadler commented 3 weeks ago

A next-level example (with two array fields in the model, one of which can also be empty) for testing as soon as the basic example in https://github.com/acdh-oeaw/rdfproxy/issues/57#issue-2467510363 works.

A note for future query writers: all array fields which can be empty need to go in an OPTIONAL {} clause. If rdfproxy should return such fields as [] or null is a question for a future date.

Example

A SPARQL result like this (Wikidata query):

gnd	nameLabel	educated_atLabel	work_name
119359464	Schindel		Gebürtig
115612815	Geiger	University of Vienna	Der alte König in seinem Exil
115612815	Geiger	University of Vienna	Unter der Drachenwand
1136992030	Edelbauer	University of Vienna	Das flüssige Land
1136992030	Edelbauer	University of Applied Arts Vienna	Das flüssige Land

and model:

from rdfproxy import from_sparql
from pydantic import BaseModel

class Work(BaseModel):
    name: Annotated[str, from_sparql(binding="work_name")]

class Person(BaseModel):
    gnd: str,
    surname: Annotated[str, from_sparql(binding="nameLabel")]
    education: list[Annotated[str, from_sparql(binding="educated_atLabel")]]
    works: list[Work]

should result in

[
  {
    "gnd": "119359464",
    "surname": "Schindel",
    "education": [],
    "works": [
      {
        "name": "Gebürtig"
      }
    ]
  },
  {
    "gnd": "115612815",
    "surname": "Geiger",
    "education": ["University of Vienna"],
    "works": [
      {
        "name": "Der alte König in seinem Exil"
      },
      {
        "name": "Unter der Drachenwand"
      }
    ]
  },
  {
    "gnd": "1136992030",
    "surname": "Edelbauer",
    "education": ["University of Vienna", "University of Applied Arts Vienna"],
    "works": [
      {
        "name": "Das flüssige Land"
      }
    ]
  }
]

kevinstadler commented 3 weeks ago

Since grouping/array merging will be a very frequent use-case/route, it might be nice to be able to already mark the default 'grouping field' of a model inside the annotation (I've called it id since this is how frontend might think about it, but any other parameter name will do):

from rdfproxy import from_sparql
from pydantic import BaseModel

class Person(BaseModel):
    wikidataid: Annotated[str, from_sparql(binding="person", id=True)]
    name: Annotated[str, from_sparql(binding="personLabel")]
    jobs: list[Annotated[str, from_sparql(binding="jobLabel")]]

example query:

# all the Franz Mayers and their jobs
# if you group this result by anything but the wikidata entity/id, you're doomed!
SELECT ?person ?personLabel ?jobLabel

WHERE {
   ?person wdt:P735 wd:Q4925932; # Franz
           wdt:P734 wd:Q13725587; # Mayer
           wdt:P106 ?job.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

It should also be noted that not just the root model but also nested models might need an explicit id!

lu-pl commented 3 weeks ago

Note: The field by which to group a list[BaseModel field must be made explicit in order to achieve the intended behavior. This could be done by a query parameter (e.g. by repurposing the group_by parameter already implemented in the SPARQLModelAdapter.query method) or grouping could be specified in the model itself.

Specifying the group value in the model might be the better solution here because this would allow triggering grouping logic in nested models as well.

lu-pl commented 3 weeks ago

Sketch for grouping:

class Work(BaseModel):
    title: str

class Person(BaseModel):
    name: str
    works: Annotated[list[Work], SPARQLGrouping("name")]

There could be a SPARQLGroup as well as a FieldGroup type for grouping by a SPARQL binding value or model field value respectively.

Grouping types must only be legal for list[BaseModel] annotations.

Based on SPARQL bindings like

[
    {"name": "x", "title": "a"},
    {"name": "x", "title": "b"},
    {"name": "y", "title": "c"}
]

the above should result in

[
  {
    "name": "x",
    "works": [
      { "title": "a" },
      { "title": "b" },
    ]
  },
  {
    "name": "y",
    "works": [
      { "title": "c" },
    ]
  }
]

kevinstadler commented 3 weeks ago

In the case of a model with several list fields I can't think of a use-case where the lists should be grouped by different binding variables, it's basically always grouping by the 'id' of the parent model. So in order to avoid duplicate group-specifications like this:

class Work(BaseModel):
    title: str

class Institution(BaseModel):
    name: str

class Person(BaseModel):
    name: str
    works: Annotated[list[Work], SPARQLGroup("name")]
    education: Annotated[list[Institution], SPARQLGroup("name")]

I think the grouping variable aka id should be marked once on the model itself, not in the annotations of the lists.

On second thought, the grouping variable will typically be a URI identifier which is not always relevant for the frontend, so it should be possible to group by a binding variable without necessarily having to include it as a model field. How about introducing a magic word like __id__ (or __groupingvariable__ or whatever)?

class Work(BaseModel):
    title: str

class Institution(BaseModel):
    name: str

class Person(BaseModel):
    # just used for grouping any list fields, not actually included in API result
    __id__: from_sparql(binding='person')

    name: Annotated[str, from_sparql(binding='personLabel')]
    works: list[Work]
    education: list[Institution]

lu-pl commented 3 weeks ago

In the case of a model with several list fields I can't think of a use-case where the lists should be grouped by different binding variables, it's basically always grouping by the 'id' of the parent model. So in order to avoid duplicate group-specifications like this:
class Work(BaseModel):
    title: str

class Institution(BaseModel):
    name: str

class Person(BaseModel):
    name: str
    works: Annotated[list[Work], SPARQLGroup("name")]
    education: Annotated[list[Institution], SPARQLGroup("name")]
I think the grouping variable aka id should be marked once on the model itself, not in the annotations of the lists.

On second thought, the grouping variable will typically be a URI identifier which is not always relevant for the frontend, so it should be possible to group by a binding variable without necessarily having to include it as a model field. How about introducing a magic word like __id__ (or __groupingvariable__ or whatever)?
class Work(BaseModel):
    title: str

class Institution(BaseModel):
    name: str

class Person(BaseModel):
    # just used for grouping any list fields, not actually included in API result
    __id__: from_sparql(binding='person')

    name: Annotated[str, from_sparql(binding='personLabel')]
    works: list[Work]
    education: list[Institution]

Valid.

ClassVar and "private" attributes are excluded from the model, so this wouldn't be a problem for model serialization, see Automatically excluded attributes.

I would prefer calling the attribute __grouping__ or something though, "id" is imo missleading because the value of this attribute will simply be a SPARQL binding name (not a binding value) and not something ID-y.

lu-pl commented 3 weeks ago

Grouping by model class attribute also raises the question about library interface consistency.

The solution with typing.Annotated would be neat since typing.Annotated is a sane choice for the explicit allocation behavior from #56 and is already implemented there.

Using typing.Annotated for triggering grouping would also be more explicit, although potentially redundant.

I am not in favor of anything right now.

kevinstadler commented 1 week ago

New example test case which features both parallel and nested lists:

WikiData query:

# Austrian writers and their books
SELECT ?gnd ?nameLabel ?educated_atLabel ?work_name ?work ?viaf
WHERE {
   ?author wdt:P106 wd:Q36180; # is writer
           wdt:P27 wd:Q40; # nationality Austrian
           wdt:P734 ?name;
           wdt:P800 ?work;
           wdt:P227 ?gnd;
           wdt:P569 ?dob.
   ?work wdt:P1476 ?work_name.
   OPTIONAL { ?work wdt:P214 ?viaf. }
   FILTER (?gnd = "119359464" || ?gnd = "1136992030" || ?gnd = '115612815')
   OPTIONAL { ?author wdt:P69 ?educated_at. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

from rdfproxy import SPARQLBinding
from pydantic import BaseModel

class Work(BaseModel):
    name: Annotated[str, SPARQLBinding("work_name")]
    # another list of a primitive type
    viafs: list[Annotated[str, SPARQLBinding("viaf")]

class Author(BaseModel):
    gnd: str
    surname: Annotated[str, SPARQLBinding("nameLabel")]
    works: list[Work]

    # list of primitive types! might have to be implemented as a BaseModel class which
    # resolves to a single JSON string (rather than an object with a single string field)
    education: list[Annotated[str, SPARQLBinding("educated_atLabel")]]

expected output

[
  {
    "gnd": "119359464",
    "surname": "Schindel",
    "education": [],
    "works": [
      {
        "name": "Gebürtig",
        "viafs": []
      }
    ]
  },
  {
    "gnd": "115612815",
    "surname": "Geiger",
    "education": ["University of Vienna"],
    "works": [
      {
        "name": "Der alte König in seinem Exil",
        "viafs": [299260555, 6762154387354230970008]
      },
      {
        "name": "Unter der Drachenwand",
        "viafs": [2277151717053313900002]
      }
    ]
  },
  {
    "gnd": "1136992030",
    "surname": "Edelbauer",
    "education": ["University of Vienna", "University of Applied Arts Vienna"],
    "works": [
      {
        "name": "Das flüssige Land",
        "viafs": []
      }
    ]
  }
]

kevinstadler commented 1 week ago

A model for the same query with nested BaseModels instead of primitive lists:

from rdfproxy import SPARQLBinding
from pydantic import BaseModel

class Institution(BaseModel):
    name: Annotated[str, SPARQLBinding("educated_atLabel")]

class Viaf(BaseModel):
    num: Annotated[str, SPARQLBinding("viaf")]

class Work(BaseModel):
    class Config:
        group_by = "work_name"

    name: Annotated[str, SPARQLBinding("work_name")]
    viafs: list[Viaf]

class Author(BaseModel):
    class Config:
        group_by = "nameLabel"

    gnd: str
    surname: Annotated[str, SPARQLBinding("nameLabel")]
    works: list[Work]
    education: list[Institution]

lu-pl commented 1 week ago

Note: Union types + default with typing.Annotated are expressed e.g. like so

class Institution(BaseModel):
    name: Annotated[str | None, SPARQLBinding("educated_atLabel")] = None

@kevinstadler This is why we couldn't get the example to work today. 🙈

The current implementation generates

[
    {
        "gnd": "119359464",
        "surname": "Schindel",
        "works": [
            {
                "name": "Geb\u00fcrtig"
            }
        ],
        "education": [
            {
                "name": null
            }
        ]
    },
    {
        "gnd": "115612815",
        "surname": "Geiger",
        "works": [
            {
                "name": "Der alte K\u00f6nig in seinem Exil"
            },
            {
                "name": "Unter der Drachenwand"
            }
        ],
        "education": [
            {
                "name": "University of Vienna"
            }
        ]
    },
    {
        "gnd": "1136992030",
        "surname": "Edelbauer",
        "works": [
            {
                "name": "Das fl\u00fcssige Land"
            }
        ],
        "education": [
            {
                "name": "University of Vienna"
            },
            {
                "name": "University of Applied Arts Vienna"
            }
        ]
    }
]

so an immediate todo is the array collection behavior for list field types (other than list[BaseModel].