Creating custom collection

tino097 commented 5 months ago

Im tryng to implement the ICollection interface with the following code


    # ICollection
    def get_collection_factories(self) -> dict[str, CollectionFactory]:

        return {
            'my-users': lambda n, p, **kwargs: cu.ApiListCollection(
                    n,
                    p,
                    data_factory=cu.ApiListData(action='user_list'))
        }

class UserSerializer(cu.Serializer):

    def serialize(self) -> list[dict[str, str]]:
        result = []
        for user in self.attached.data:
            result.append({
                "id": user.get("id"),
                "name": user.get("name"),
                "email": user.get("email"),
                "fullname": user.get("fullname")
            })
        return result

class UserCollection(cu.ApiCollection):

    def __init__(self, name: str, params: dict, **data_settings):
        super().__init__(name, params, **data_settings)
        self.serializer_factory = UserSerializer
        self.data_settings = data_settings

# def my_factory(name: str, params: dict, **kwargs: Any) -> UserCollection:
#     return UserCollection(name, params, **kwargs)

So this collection is getting registered and im able to select on the explorer component but when selected im getting the following error

    data_factory=cu.ApiListData(action='user_list'))
TypeError: Data.__init__() missing 1 required positional argument: 'obj'

What im missing to set to initialize the ApiListCollection ?

Thanks

smotornyuk commented 5 months ago

Hi @tino097, sorry for delay

Your error is caused by data_factory. It must be a class itself, not the object. And parameters need to be passed as data_settings. So, the full version of my-users is:

{
  'my-users': lambda n, p, **kwargs: cu.ApiListCollection(
    n,
    p,
    data_factory=cu.ApiListData
    data_settings={"action": 'user_list'})
}

I'll rewrite readme and add examples of collection creation in the beginning, before diving into the internals. If you can give me a couple of use-cases, it would be a good material for documentation

smotornyuk commented 5 months ago

And regarding UserCollection at the end of your code snippet. Most likely, you want to register a collection, that uses a custom serializer, and assigning data_settings is accidental

For such a situation, where you only want to replace a factory, you can omit the constructor and assign the factory to the corresponding attribute:

class UserCollection(cu.ApiCollection):
    SerializerFactory = UserSerializer

Signature of collection constructor is def __init__(self, name: str, params: dict, **kwargs):. The important part here is kwargs - note, it's not data_settings.

For example, when you build a collection with Collection(n, p, data_settings={}), I imagine, you want to access this data_settings, right? In this case, data_settings is kept inside kwargs:

class MyCollection(cu.ApiCollection):

    def __init__(self, name: str, params: dict, **kwargs):
        super().__init__(name, params, **kwargs)

        print("THIS IS DATA SETTINGS ->", kwargs.get("data_settings"))

And, with these comments, we can try building your collection. If you just want a user list that filters user using q parameter and displays only id, name, imail, fullname; you need the following code:

# ICollection
    def get_collection_factories(self) -> dict[str, CollectionFactory]:

        return {
            'my-users': MyUserCollection,
        }

##### your implementation of UserSerializer is left unchanged ####

# ApiList and Api collections just override the data factory. We are going to do it
# ourselves, so there will be no difference if we just use a simple collection as a base class
class MyUserCollection(cu.Collection):

    # Data.with_attributes defines the anonymous class with a specific attribute
    # overriden. If you are not going to use your custom data factory
    # elsewhere, this is the shortest possible syntax
    DataFactory = cu.ApiListData.with_attributes(action="user_list")

    SerializerFactory = UserSerializer

BTW, in your initial implementation, instead of cu.ApiListData(action='user_list')) which created an object and caused an error, you could use cu.ApiListData.with_attributes(action='user_list')) which would create a new class with fixed value of action.

tino097 commented 4 months ago

Thanks @smotornyuk

tino097 commented 4 months ago

Hey @smotornyuk

from ckanext.collection import internal, types
ImportError: cannot import name 'internal' from 'ckanext.collection'

I've pulled latest from master

smotornyuk commented 4 months ago

Thanks. I forgot to commit internal.py. Now it's added to the repo, so issue must be fixed in latest commit

BTW, I'm rewriting the documentation. At the moment, I finished pages above the red line Everything below the red line still in draft state.

Mainly, I'm trying to explain things gradually with more examples. And there is one change: instead of importing everything like import ckanext.collection.utils as cu, it's recommended to import shared module and access items from it.

from ckanext.collection.shared import collection, data, serialize

#and use it like below
collection.Collection
data.ApiSearchData
serialize.CsvSerializer

tino097 commented 4 months ago

To confirm, if i want to have a custom data, i would need to create my own action where i would get desired information?

Or if i could use the ModelCollection for that purpose?

smotornyuk commented 4 months ago

Using ModelCollection is more efficient, but there are certain disadvantages.

If you use ModelCollection with a specific model from CKAN, you'll get all the records from DB. Imagine that you create ModelCollection for the model.Package - you'll get public, private, deleted, and draft datasets at once. If you are showing this collection to admin only - it's ok. If you are filtering results from the collection before showing it to the anonymous user - it is also ok. But it's your responsibility to protect private data and show collection only to people with required access level

If you are using API action instead of the model, all restrictions are handled inside the action. If you use ApiSearchCollection that takes data from package_search, package_search is called with the current user and gives you back only datasets that are accessible by the current user.

So, the answer is:

if only trusted users see the collection: use ModelCollection
if anyone sees the collection and you already have action that hides private data: use action
if anyone sees the collection and you don't have and action: you can either create and action and use it with the API collection or extend ModelCollection and filter data inside it - as you have to implement this filtration logic, it doesn't really matter where it will be done.

tino097 commented 4 months ago

My use cases are to get reports within CKAN, as example:

Get all users and have column for organization / group membership or any showcase or apps
Report for datasets by user in orgs or groups

So there would be a filtering and restrictions over some of the data but im trying to figure it out what would be the right path

Thanks again

smotornyuk commented 4 months ago

Cool, another example for the time, when I continue updating documentation.

Here you can use models directly. It doesn't sound like you'll be able to use API actions that collect data elsewhere, so creating them is not much value. Here's the code that creates a collection of every user. The collection contains the user's ID, name, full name, and all groups + organizations of the user.

Example

```python from __future__ import annotations import sqlalchemy as sa from ckan import model from ckanext.collection.shared import collection, data, serialize # aliases that required to select data from the same model twice, for `groups` # column and for `organizations` column. stmt_groups = sa.alias(model.Group, "groups") stmt_orgs = sa.alias(model.Group, "organizations") # Data factory that executes SQLAlchemy statement to compute data # records. StatementSaData accepts `statement` attribute(sqlalchemy.sql.Select # instance) and uses this statement to fetch data from DB. This is a low-level # data factory that can be used when you need a Collection over arbitrary SQL # query. I do not recommend using ModelData here, because ModelData optimized # for work with a single model, while here we have to combine data from User, # Member and Group models. # # I'm using CLS.with_attributes(...) here, but if you read documentation, you # already know that it's the same as if I defined class: # # >>> class UserData(data.StatementSaData): # >>> statement = sa.select(...) # and here goes the whole value of select attribute. # UserData = data.StatementSaData.with_attributes( statement=sa.select( model.User.id, model.User.name, model.User.fullname, sa.func.string_agg(stmt_groups.c.name, ",").label("groups"), sa.func.string_agg(stmt_orgs.c.name, ",").label("organizations"), ) .outerjoin( model.Member, sa.and_( model.User.id == model.Member.table_id, model.Member.table_name == "user", ), ) .outerjoin( stmt_groups, sa.and_( stmt_groups.c.id == model.Member.group_id, stmt_groups.c.type == "group" ), ) .outerjoin( stmt_orgs, sa.and_( stmt_orgs.c.id == model.Member.group_id, stmt_orgs.c.type == "organization" ), ) .group_by(model.User) ) # the collection itself. As you can see, the heavy work is done by data factory. class UserCollection(collection.Collection): DataFactory = UserData # I don't know what format of report you are going to use, so let's choose CSV SerializerFactory = serialize.CsvSerializer # initialize a collection users = UserCollection() # transform it into CSV print(users.serializer.serialize()) ```

To add filters to the collection, we need to modify the data factory. It will be converted into a standard class (instead of using .with_attributes). The value of statement is not changed. statement defines the baseline of the source data - it must include as much data as possible. Filters will be applied by defining the statement_with_filters method.

Example

```python class UserData(data.StatementSaData): # statement is not changed statement = ... # this method is responsible for filtration. It's called automatically, # accepts `statement` of data factory and must return statement with # filters applied def statement_with_filters(self, stmt: sa.sql.Select) -> sa.sql.Select: # `self.attached` is a reference to collection that holds data # factory. `params` attribute contains data from the second argument # passed to the collection constructor params = self.attached.params # let's filter by exact match when using name if "name" in params: stmt = stmt.where(stmt.selected_columns["name"] == params["name"]) # fullname will use case-insensitive substring match if "fullname" in params: fullname = params["fullname"] stmt = stmt.where(stmt.selected_columns["fullname"].ilike(f"%{fullname}%")) # groups/organizations can are filtered as fullname. But you'll # probably use something more sophisticated for group_type in ["groups", "organizations"]: if group_type not in params: continue value = params[group_type] stmt = stmt.having(stmt.selected_columns[group_type].contains(value)) return stmt # this class remains unchanged class UserCollection(collection.Collection): ... # `params` used by `statement_with_filters` is a dictionary # passed as a second argument to collection constructor. You can build html-form, # submit it and extract data from `ckan.plugins.toolkit.request.args`. This value # is a good candidate for `params` users = UserCollection("", {"name": "default"}) # transform it into CSV print(users.serializer.serialize()) ```

And here's the distribution of datasets created by users in different organizations/grops defined in the same manner

Example

```python from __future__ import annotations import sqlalchemy as sa from ckan import model from ckanext.collection.shared import collection, data, serialize # aliases that required to select data from the same model twice, for `groups` # column and for `organizations` column. package_membership = sa.alias(model.Member) user_membership = sa.alias(model.Member) class GroupStatsData(data.StatementSaData): # statement is not changed statement = ( sa.select( model.Group.name.label("group_name"), model.Group.title, model.Group.type, sa.func.count(model.Package.id).label("number of datasets"), model.User.name.label("user_name"), ) .join(user_membership, model.Group.id == user_membership.c.group_id) .join(model.User, model.User.id == user_membership.c.table_id) .join(package_membership, model.Group.id == package_membership.c.group_id) .join(model.Package, model.Package.id == package_membership.c.table_id) .where( model.User.state == "active", model.Package.state == "active", model.Group.state == "active", ) .group_by(model.User, model.Group) ) class GroupStatsCollection(collection.Collection): DataFactory = GroupStatsData SerializerFactory = serialize.CsvSerializer stats = GroupStatsCollection() print(stats.serializer.serialize()) ```

smotornyuk commented 4 months ago

Here's implmenetation of the first collection using API action, just for reference. In this case, all the logic goes to action and collection becomes slim. You may find this style more readable, as you are more used for API actions

Example

```python from __future__ import annotations from ckanext.collection.shared import collection, data, serialize # action definition @tk.side_effect_free def my_user_listing(context: Context, data_dict: dict[str, Any]) -> dict[str, Any]: tk.check_access("my_user_listing", context, data_dict) # ApiSearchData use package_search-style for parameter names. rows -> # limit, start -> offset. rows = tk.asint(data_dict.get("rows", 10)) start = tk.asint(data_dict.get("start", 0)) stmt = sa.select(model.User) total = model.Session.scalar(sa.select(sa.func.count()).select_from(stmt)) stmt = stmt.limit(rows).offset(start) # ApiSearchData expects package_search-like result, with `results` and # `count` keys return { "results": [ { "id": user.id, "name": user.name, "fullname": user.fullname, "groups": user.get_group_ids("group"), "organizations": user.get_group_ids("organization"), } for user in model.Session.scalars(stmt) ], "count": total, } UserData = data.ApiSearchData.with_attributes(action="my_user_listing") class UserCollection(collection.Collection): DataFactory = UserData SerializerFactory = serialize.CsvSerializer users = UserCollection() print(users.serializer.serialize()) ```

DataShades / ckanext-collection

Creating custom collection #1