csarven / doingbusiness-linked-data

Doing Business Linked Data
Other
1 stars 0 forks source link

Reconsider generated filenames and extensions #4

Open csarven opened 9 years ago

csarven commented 9 years ago

All files generated on the local system (e.g., *.csv) should have a pattern that's consistent, extensible, and reusable i.e., don't use camelCase and dash-separated terms in filenames; stick to a style.

Make sure to use "intended" filename extensions e.g., currently, the use of .desc for Turtle prefixes is not a common practice, and arguably wrong ( see http://www.fileextension.org/DESC ). Which software would one need to process .desc?

Files like topicId.2014.1.html should reflect the filename (content) and extension (well-known extension for the content-type used in the HTTP response headers). But, I think we could make an exception here for .html - so leave it as .html. Perhaps 2014.1 or 2014.topic.1 for the file name, and the extension from the response header. Cconsider possible conflicts or annoyances down the line.

What does refined mean? On that note, if the scripts stick to one-time delimiter assignments, some of the preprocessing (e.g., like the generated refined files) can be simply skipped!

It may be a good idea to reserve a filename and extension like starting-a-business.2014.csv (clean looking as opposed to .sorted.merged) for the preprocessed file that's ready for the next workflow step (e.g., transformation).

Make sure that, the replacement filename pattern for topicId.2014.1.html and starting-a-business.2014.csv, and for example starting-a-business.2014.rdffollow the same pattern.

reni99 commented 9 years ago

Okey, so I reserve something like starting-a-business.2014.csv for the "final file" of the workflow step. But what about files that get created on the way to this file in the preprocessing step? should I delete them after the starting-a-business.2014.csv got created?

csarven commented 9 years ago

Good question. In a perfect world, we should keep them, but I think retaining only the raw file that was retrieved, the preprocessed file (before transformation), and the file transformed to RDF is good enough IMO. If we keep the workflow steps (in RDF) explicit as to what was done, that's be okay. For example, try to make use of prov:used (http://www.w3.org/TR/prov-o/#used or something alike) to point at the URL of the tool used e.g., https://launchpad.net/ubuntu/+source/wget

reni99 commented 9 years ago

Nice, so I'll implement it like that :)

csarven commented 9 years ago

Preferably, see which PROV-O or other property can be used to indicate the full command (including parameters) were used from command-line to execute it as well.