frictionlessdata / ckanext-datapackager

CKAN extension for importing/exporting Data Packages.
36 stars 14 forks source link

README.md included in a data package not added as a resource in CKAN #60

Open Stephen-Gates opened 6 years ago

Stephen-Gates commented 6 years ago

A data package can include a README.md. When I upload a datapackage.zip file I expect the README.md inside to be added as a resource in CKAN.

Currently the README.md is not uploaded so provenance information is lost.

amercader commented 6 years ago

That sounds like a sensible convention. The original implementation only looked at what was defined in the spec so this was skipped.

Note that READMEs will get currently imported if they are listed as resources in the descriptors just like any other file.

Implementation

This will be quite a lot trickier than it would appear. We are basically passing the uploaded zipfile to the datapackage library, that handles all the extracting etc. We get a descriptor object back that does not include the README file (or any other file included in the zip). So anything we implement will need to be on top of that.

I don't want to re-extract the files just to check if there is a README (and we probably can't), so the only thing I can think of is guessing the temp folder where datapackage extracted the zip file here (using a resource path and the name of zip file), and check the contents to see if there is a README file. If it is, create an additional resource with _create_and_upload_local_resource().

We probably want to check that there isn't an actual resource with the README to avoid duplication.

Estimate

1 day

GeraldGrootRoessink commented 6 years ago

To anticipate the importing of the README as a CKAN-resource I am looking for a markdown_view. Has any work been done of that?

Stephen-Gates commented 6 years ago

We use https://github.com/markdown-it/markdown-it in Data Curator to allow the input and preview of markdown.

Links to the CommonMark reference and community implementations at at https://github.com/commonmark/CommonMark

GeraldGrootRoessink commented 6 years ago

Thanks, I will look into it.
First I would like the imported readme.md to be readable in CKAN. Maybe it is doable to create a fourth text_view for resources with format=md using the helper-function: h.render_markdown? Is that a good idea?

ghost commented 6 years ago

thanks @amercader Yep not a pretty solution, but I can't find another way of doing this neatly either - I think perhaps adding README as resource is not the way to go as my understanding is (and happy to be corrected here) that a README isn't considered part of the resources for a datapackage as it doesn't hold data - and can't be validated and inferred like other resources - , and working out the temp folder (from existing resources) seems the only way for now. Although datapackage has 'base_path' property, which might make this simpler, but it is deprecated (but maybe it is just default_base_path that is meant to be deprecated?). Asked @roll about this.

ghost commented 6 years ago

Hi @amercader In the end I've disassociated the sense of resources (just in my head) between frictionless datapackages and ckan. Have a PR for this with uploading zip, where README.md is a local resource (as you have advised, and then download ensures it doesn't exist in associated package.json). Just waiting for approval on our side. Also had a question, just in regards to issue of handling downloads:

For the upcoming PR I have considered that as the datapackage.json would not reference README.md, containing that README.md in the download zip accounts for the README.md file. But as datapackage.json was being created as file (as would never/rarely be large) and then downloaded, so too for README.md (the case would usually/always be README.md is only a small file by comparison), so it would be OK to add this to, a simple zip, which would have a set list of filenames - that's the only way I can consider restricting this atm, unless it should also have some explicit size check (apart from python's zip library constraint for size) (atm just README.md) that could be downloaded in this way (ie: without using queue or considering the delay in creating the zip and downloading. The intention is that the 'simple' zip would never contain potentially very large data resources such as CSV files in the zip. That way we can complete this functionality for including a README.md, without tackling the much larger requirement (which I don't think I can find room for), in issue 52 What are your thoughts?