frictionlessdata / forum

šŸ—£ Frictionless Data Forum esp for "How do I" type questions
https://frictionlessdata.io/
10 stars 0 forks source link

Table Schema Validation for Files with Multi-line Metadata or No Headers at All #41

Closed aidanmontare-edu closed 4 years ago

aidanmontare-edu commented 4 years ago

I'm very impressed by the tools and workflows at https://frictionlessdata.io, and I was wondering if there was something similar to Table Schema, but that supported CSV files with multiple-line headers of various formats, or no headers at all.

To clarify, by multiple-line headers, I'm referring to non-CSV data that precedes the CSV data. This is typically used to store metadata about the information in a dataset. The case of CSV files without headers (i.e. no column names, the first line in the file is the first data row) is also important to my use case.

While these kind of files aren't the standard (i.e. RFC4180) way of writing CSV files, they're very common, and it would be very useful to be able to write schema files that they can be validated against. The specific task I have in mind is a program that can detect which schema a CSV file is an example of, and coerce that data into a new, more standard format.

I think this may be helpful to those in the scientific community who are dealing with many files of slightly different formats (at least, it would help my project). I'd appreciate hearing if you have any suggestions, or know of any work like yours that might be applicable to my situation.

rufuspollock commented 4 years ago

@aidanmontare-edu - great to hear from you.

Having "no headers" is already supported by setting header to false.

https://specs.frictionlessdata.io/csv-dialect/

Multiline headers is a more open issue - it's been brought up a few times and we don't have a good solution. I've just opened an issue in specs about this: https://github.com/frictionlessdata/specs/issues/681

aidanmontare-edu commented 4 years ago

Thanks for getting back to me!

If it's not too much of a hassle, could you point me to an example of using the header setting with something like goodtables? I've worked out a rough solution that passes the list of column names to the headers argument of goodtables.validate for files that do not have a header. What you're describing seems simpler, but I'm not sure how to pass that option to goodtables.

As for multiline headers, my current solution is to have some additional code to find the lines of metadata, and then pass those lines in the skip_rows argument of goodtables.validate. This does seem to work fairly well.

rufuspollock commented 4 years ago

@roll think this is one for you now šŸ˜‰

roll commented 4 years ago

Hi @aidanmontare-edu,

Sorry the documentation is not yet properly consolidated for goodtables (I'm working on it now) but you can use all the tabulator options for goodtables.validate including headers in various forms - https://github.com/frictionlessdata/tabulator-py#headers

Here is an example of a multiline headers row for tabulator but you can pass the same options to goodtables - https://colab.research.google.com/drive/1gfB1pc7hO-lj2947nuAjStRFFpRJ0jru

rufuspollock commented 4 years ago

FIXED.

@aidanmontare-edu did @roll's response help you? I'm going to close this for now as i think a good part is answered (affirmatively) and the other item (multiline support in base spec) is in frictionlessdata/specs#681