frictionlessdata / datapackage-py

A Python library for working with Data Packages.
https://frictionlessdata.io
MIT License
189 stars 44 forks source link

Update package descriptor after resource descriptor has been updated #213

Closed simnh closed 3 years ago

simnh commented 6 years ago

Hey,

I want to automatically update resource descriptors, however, the descriptor of the whole package is not updated after I update the resource descriptor.

For example:

p = Package('datapackage.json')
r = p.get_resource('r1')
r.descriptor = new_descriptor
r.commit()
p.commit()

r.descriptor # -> updated description
p.descriptor # -> still the old one 

Is there a way to make code like above work?

zaneselvans commented 5 years ago

I think I am also seeing this issue but in the context of an updated field descriptor that is failing to propagate upward into the table schema that the field is part of.

e.g.

mines_table.schema.get_field('STATE').descriptor.update({'description': 'State abbreviation.'})
mines_table.schema.commit()
print(mines_table.schema.descriptor['fields'][10])
print(mines_table.schema.get_field('STATE').descriptor)

yields:

{'name': 'STATE', 'type': 'string', 'format': 'default'}
{'name': 'STATE', 'type': 'string', 'format': 'default', 'description': 'State abbreviation.'}
zaneselvans commented 5 years ago

Upon some investigation, the field/table issue is a bit different than the resource/package issue, so I've created a new issue over in the tableschema-py repository:

https://github.com/frictionlessdata/tableschema-py/issues/217

zaneselvans commented 5 years ago

Looking at the code, there's not currently any infrastructure for altering an existing resource, and notifying the parent package about it. Caling add_resource() and remove_resource() both result in package.__current_descriptor being updated (the resource descriptor is appended to or deleted from the list of resources) and then package.__build() is called. So if the ordering of resources within the data package isn't important, then one solution might be to use p.get_resource('r1') make your edits, and then remove and re-add it to the package -- but this would result in the resource then being at the end of the list of resources.

Alternatively, since it seems like there will be a lot of updating existing resources that needs to happen in the course of fixing and adding on to any attributes which are inferred, and since folks do seem to be referring to resources based on their location in the list of resources, maybe there needs to be a package.update_resource.() method that works like the add/remove methods, except that it updates the existing resource in place and re-builds the package, preserving the resource ordering.

As it is right now it seems like you have to start at the lowest level in the overall hierarchy of descriptors, and work your way up -- define a bunch of fields, and add them to a schema. Use the schema to make a table. Use the table to make a resource, and add that resource to a package. Then start over with the fields from the next table. I had thought the usage pattern was more meant to be infer a bunch of stuff from the files that will make up the data, hopefully getting a correct collection of metadata, then go in and clean it up or add to it a bit.

roll commented 5 years ago

@zaneselvans Not sure I fully understand. Our target in this issue is to make a package's descriptor updated after a resource's descriptor change.

We have this method:

resource.py

    def commit(self, strict=None):
        """https://github.com/frictionlessdata/datapackage-py#resource
        """
        if strict is not None:
            self.__strict = strict
        elif self.__current_descriptor == self.__next_descriptor:
            return False
        self.__current_descriptor = deepcopy(self.__next_descriptor)
        self.__table = None
        self.__build()
        return True

To achieve our goal as I see it we only need to make changes like these:

    def commit(self, strict=None):
        """https://github.com/frictionlessdata/datapackage-py#resource
        """
        if strict is not None:
            self.__strict = strict
        elif self.__current_descriptor == self.__next_descriptor:
            return False
        self.__current_descriptor = deepcopy(self.__next_descriptor)
        self.__table = None
        self.__build()

        # find the index of the resource in self.__package.descriptor (or get the index in the constructor)
        # self._package.descriptor['resources'][index] = deepcopy(self.__current_descriptor)
        # self._package.commit()

        return True

And this should be it. Shouldn't it?

zaneselvans commented 5 years ago

Yes you're right -- we could go up from the resource into the package it knows itself to be part of and find the correct resource descriptor to update, but that only works if the package is set. If you create a Resource and then later add it to a package, that information doesn't exist. What do you mean "get the index in the constructor?" Since the index of the resource within the package isn't necessarily persistent (given resource removal and addition) it seems like the only way to check which descriptor in the package belongs to the resource that's been changed is to compare their descriptors and find a matching name (though that would fail if the name is the attribute which has changed).

I guess before I really looked inside, I imagined that hierarchy of structures making up the data package was preserved and accessible as software objects rather than simply json descriptors, so one could tell the top level (package) to build / commit and in doing so it would recursively ask all of the resources, schemas, fields, etc.for their new descriptors and assemble those into the package descriptor.

Without having those persistent relationships between the different objects, then it seems like just the expected usage pattern is just dealing with the individual objects one at a time and then composing their individual descriptors into a full package. Is that the intention?

roll commented 5 years ago

Resource and then later add it to a package, that information doesn't exist.

Good catch. Then package.add_resource should set resource.package

What do you mean "get the index in the constructor?"

I mean to make it more robust package could initiate a resource passing a resource index (not only parent package link). But I haven't dived to deep into this thinking TBH

zaneselvans commented 5 years ago

But package.add_resource() (and the package itself) don't have access to the Resource object so it can't set resource.package -- add_resource() just takes a descriptor.

The mixing of the two ways of thinking about the package/resource/schema/field information (object vs. JSON descriptor) is what tripped me up initially. In writing [my first packaging script](https://github.com/catalyst-cooperative/pudl/tree/master/scripts/data_pkgs/pudl-msha] I ended up making very little use of the [datapackage] modules, and instead primarily manipulated the metadata as descriptors. I only created a Resource object when I thought the descriptor was done, and used it to infer some additional metadata and check to see if the whole mess was valid (and the same with the Package object).

This usage pattern can work, but it's fairly different than many of the examples that are posted -- which start by loading some data and inferring metadata from it, rather than reading in a JSON descriptor that has been written by hand, modifying it, and only at the end using it to create a software object, which seems to be what's required if you want to manipulate the internal parts of the nested specified structures.

Does that seem right?

roll commented 5 years ago

But package.add_resource() (and the package itself) don't have access to the Resource object so it can't set resource.package -- add_resource() just takes a descriptor.

Yes, of course. I mean it requires some complex approach. It should be easy to implement but still: a few updates here, a few updates there etc.

In high-level, yes, the intended approach for this iteration (focused on data publishers) was like:

So for now, more complex editing API (object-based) is not really established. But at the same time, there is a not closed path to improve/implement it.

roll commented 3 years ago

It's been fixed in Frictionless Framework as the whole working with metadata design was re-worked