frictionlessdata / forum

🗣 Frictionless Data Forum esp for "How do I" type questions
https://frictionlessdata.io/
10 stars 0 forks source link

Does Data Package spec support PDFs as data source? #2

Closed Irio closed 4 years ago

Irio commented 4 years ago

Background

Recently, I decided to dedicate some time to play with some real data I have to help me learn more about the specifications of Data Packages. This personal app I'm building, which is not open at this moment, relies on a primary data source that is a bunch of PDFs – non-structured, non-tabular, but could be represented more semantically as an HTML if I end up doing the conversion on my own.

Question

Considering the minimum example package from http://frictionlessdata.io/docs/data-package/#getting-started, I understand it is acceptable and a valid data package to have a set of PDFs in the /data folder and leave it without a schema. Is my understanding correct, and so, this use case makes a valid data package? Since the documentation is not explicit about this use case and I can't easily find datapackages with PDFs, I feel the need to ask for clarification.

rufuspollock commented 4 years ago

@Irio yes, you can completely use data packages for this use case and for a collection of PDFs.

If you want to see an example I did this a lot in the Official Inquiries / Reports That Matter project I started a few years ago:

http://reportsthatmatter.org/

See for example, this datapackage.json

https://github.com/official-inquiries/uk-iraq-inquiry/blob/gh-pages/datapackage.json

(Please feel free to keep asking questions or re-open if this does not answer sufficiently)

augusto-herrmann commented 4 years ago

Interesting, I hadn't really thought of that use case.

I was documenting my data sources primarily in markdown text, but I shall experiment with using data packages for that as well.

I wonder if using PDFs like that would break Good Tables continuous data validation.

augusto-herrmann commented 4 years ago

To answer my own question: you exclude these data packages from Good Tables validation by specifying a goodtables.yml configuration file and keeping these data packages out of it. :)

Nevertheless, what is the advantage of defining one such data package with PDFs, compared to, say, just documenting your PDFs in text in a Markdown file? Perhaps to export it to a data catalogue such as CKAN and have the metadata be mapped automatically (as long as such tool exists)?

Irio commented 4 years ago

@augusto-herrmann That's similar to how I thought about using it.

Each of the thousands PDFs has some metadata – e.g., publicated_at, persisted_at, name, municipality_id – which I would like to save at the same time I download the files. Moments later, I would process the information present in data packages and PDF pages into a database. The folder in the filesystem, with the data in its initial and more raw form, would be compressed and backed up in a cheap storage such as AWS Glacier.

augusto-herrmann commented 4 years ago

@Irio that would make sense, yes.

The metadata you mentioned aren't standard Data Package fields. I assume you are using a custom profile for that. Are you planning to publish this profile somewhere? Perhaps it could be shared by other projects and eventually become a spec such as Fiscal Data Package.