Check: Text file format valid

NCEAS / metadig-rake

MetaDIG rake, a cross-domain QA/QC library

Apache License 2.0

2 stars 0 forks source link

Check: Text file format valid #2

Closed jeanetteclark closed 1 year ago

jeanetteclark commented 2 years ago

Purpose

This check will look to see if a tabular data file in a text format can be parsed.

Components

is a text format (boolean)
file name
distribution URL
number of header lines
delimiter

Result

SUCCESS: if one or more files are parsed correctly or no text files exist FAILURE: if no files can be parsed ERROR: if files cannot be accessed

mbjones commented 2 years ago

@jeanetteclark ERROR is reserved for when the test fails to run (e.g. the network is down). An ERROR indicates a bug in the system, not a data driven failure. When a test runs to completion, it should always return SUCCESS or FAILURE based on the content evaluation. Happy to discuss.

jeanetteclark commented 2 years ago

Okay that makes sense. I'll move the "no text files exist" case to success

jeanetteclark commented 1 year ago

This check is nearly done - need to do some work to make the mechanism for retrieving data pids (and thus URLs/paths) for data access consistent with what I did for the data format check

mbjones commented 1 year ago

Great! Can you define 'text'? Do you mean ASCII? UTF-8? UTF-16? Other unicode encodings? Windows cp-1252?

jeanetteclark commented 1 year ago

so I've been thinking the name should probably be changed, since this check is really about delimited text files (csv, tsv) and doesn't deal with encodings at all. Files are identified by looking in the metadata for entities with a physical/dataFormat/textFormat element. I think though that we should probably be checking on formatId instead. Happy to hear your thoughts

mbjones commented 1 year ago

Aha, that makes sense. Yes, I think using formatId of text/csv for example makes sense to apply this test. How about naming it something more like data.table-text-delimited.well-formed? See naming discussion in #15 .

Some related tests might be metadata.formatId.congruent (to test if the formatId and the values inside the metadata format fields like physical/dataFormat match) and data.format.congruent (to test if the data format found in the file matches what is claimed in the metadata formatId.

jeanetteclark commented 1 year ago

check has been renamed and restructured @c47b03c8c

going to close this one for now