Closed simnh closed 3 years ago
I think I am also seeing this issue but in the context of an updated field descriptor that is failing to propagate upward into the table schema that the field is part of.
e.g.
mines_table.schema.get_field('STATE').descriptor.update({'description': 'State abbreviation.'})
mines_table.schema.commit()
print(mines_table.schema.descriptor['fields'][10])
print(mines_table.schema.get_field('STATE').descriptor)
yields:
{'name': 'STATE', 'type': 'string', 'format': 'default'}
{'name': 'STATE', 'type': 'string', 'format': 'default', 'description': 'State abbreviation.'}
Upon some investigation, the field/table issue is a bit different than the resource/package issue, so I've created a new issue over in the tableschema-py
repository:
https://github.com/frictionlessdata/tableschema-py/issues/217
Looking at the code, there's not currently any infrastructure for altering an existing resource, and notifying the parent package about it. Caling add_resource()
and remove_resource()
both result in package.__current_descriptor
being updated (the resource descriptor is appended to or deleted from the list of resources) and then package.__build()
is called. So if the ordering of resources within the data package isn't important, then one solution might be to use p.get_resource('r1')
make your edits, and then remove and re-add it to the package -- but this would result in the resource then being at the end of the list of resources.
Alternatively, since it seems like there will be a lot of updating existing resources that needs to happen in the course of fixing and adding on to any attributes which are inferred, and since folks do seem to be referring to resources based on their location in the list of resources, maybe there needs to be a package.update_resource.()
method that works like the add/remove methods, except that it updates the existing resource in place and re-builds the package, preserving the resource ordering.
As it is right now it seems like you have to start at the lowest level in the overall hierarchy of descriptors, and work your way up -- define a bunch of fields, and add them to a schema. Use the schema to make a table. Use the table to make a resource, and add that resource to a package. Then start over with the fields from the next table. I had thought the usage pattern was more meant to be infer a bunch of stuff from the files that will make up the data, hopefully getting a correct collection of metadata, then go in and clean it up or add to it a bit.
@zaneselvans Not sure I fully understand. Our target in this issue is to make a package's descriptor updated after a resource's descriptor change.
We have this method:
resource.py
def commit(self, strict=None):
"""https://github.com/frictionlessdata/datapackage-py#resource
"""
if strict is not None:
self.__strict = strict
elif self.__current_descriptor == self.__next_descriptor:
return False
self.__current_descriptor = deepcopy(self.__next_descriptor)
self.__table = None
self.__build()
return True
To achieve our goal as I see it we only need to make changes like these:
def commit(self, strict=None):
"""https://github.com/frictionlessdata/datapackage-py#resource
"""
if strict is not None:
self.__strict = strict
elif self.__current_descriptor == self.__next_descriptor:
return False
self.__current_descriptor = deepcopy(self.__next_descriptor)
self.__table = None
self.__build()
# find the index of the resource in self.__package.descriptor (or get the index in the constructor)
# self._package.descriptor['resources'][index] = deepcopy(self.__current_descriptor)
# self._package.commit()
return True
And this should be it. Shouldn't it?
Yes you're right -- we could go up from the resource into the package it knows itself to be part of and find the correct resource descriptor to update, but that only works if the package
is set. If you create a Resource and then later add it to a package, that information doesn't exist. What do you mean "get the index in the constructor?" Since the index of the resource within the package isn't necessarily persistent (given resource removal and addition) it seems like the only way to check which descriptor in the package belongs to the resource that's been changed is to compare their descriptors and find a matching name (though that would fail if the name is the attribute which has changed).
I guess before I really looked inside, I imagined that hierarchy of structures making up the data package was preserved and accessible as software objects rather than simply json descriptors, so one could tell the top level (package) to build / commit and in doing so it would recursively ask all of the resources, schemas, fields, etc.for their new descriptors and assemble those into the package descriptor.
Without having those persistent relationships between the different objects, then it seems like just the expected usage pattern is just dealing with the individual objects one at a time and then composing their individual descriptors into a full package. Is that the intention?
Resource and then later add it to a package, that information doesn't exist.
Good catch. Then package.add_resource
should set resource.package
What do you mean "get the index in the constructor?"
I mean to make it more robust package
could initiate a resource
passing a resource index (not only parent package link). But I haven't dived to deep into this thinking TBH
But package.add_resource()
(and the package
itself) don't have access to the Resource
object so it can't set resource.package
-- add_resource()
just takes a descriptor.
The mixing of the two ways of thinking about the package/resource/schema/field information (object vs. JSON descriptor) is what tripped me up initially. In writing [my first packaging script](https://github.com/catalyst-cooperative/pudl/tree/master/scripts/data_pkgs/pudl-msha] I ended up making very little use of the [datapackage] modules, and instead primarily manipulated the metadata as descriptors. I only created a Resource
object when I thought the descriptor was done, and used it to infer
some additional metadata and check to see if the whole mess was valid (and the same with the Package
object).
This usage pattern can work, but it's fairly different than many of the examples that are posted -- which start by loading some data and inferring metadata from it, rather than reading in a JSON descriptor that has been written by hand, modifying it, and only at the end using it to create a software object, which seems to be what's required if you want to manipulate the internal parts of the nested specified structures.
Does that seem right?
But package.add_resource() (and the package itself) don't have access to the Resource object so it can't set resource.package -- add_resource() just takes a descriptor.
Yes, of course. I mean it requires some complex approach. It should be easy to implement but still: a few updates here, a few updates there etc.
In high-level, yes, the intended approach for this iteration (focused on data publishers) was like:
package.descriptor
So for now, more complex editing API (object-based) is not really established. But at the same time, there is a not closed path to improve/implement it.
It's been fixed in Frictionless Framework as the whole working with metadata design was re-worked
Hey,
I want to automatically update resource descriptors, however, the descriptor of the whole package is not updated after I update the resource descriptor.
For example:
Is there a way to make code like above work?