Open csarven opened 9 years ago
Okey, so I reserve something like starting-a-business.2014.csv for the "final file" of the workflow step. But what about files that get created on the way to this file in the preprocessing step? should I delete them after the starting-a-business.2014.csv got created?
Good question. In a perfect world, we should keep them, but I think retaining only the raw file that was retrieved, the preprocessed file (before transformation), and the file transformed to RDF is good enough IMO. If we keep the workflow steps (in RDF) explicit as to what was done, that's be okay. For example, try to make use of prov:used (http://www.w3.org/TR/prov-o/#used or something alike) to point at the URL of the tool used e.g., https://launchpad.net/ubuntu/+source/wget
Nice, so I'll implement it like that :)
Preferably, see which PROV-O or other property can be used to indicate the full command (including parameters) were used from command-line to execute it as well.
All files generated on the local system (e.g.,
*.csv
) should have a pattern that's consistent, extensible, and reusable i.e., don't use camelCase and dash-separated terms in filenames; stick to a style.Make sure to use "intended" filename extensions e.g., currently, the use of
.desc
for Turtle prefixes is not a common practice, and arguably wrong ( see http://www.fileextension.org/DESC ). Which software would one need to process .desc?Files like
topicId.2014.1.html
should reflect the filename (content) and extension (well-known extension for the content-type used in the HTTP response headers). But, I think we could make an exception here for .html - so leave it as .html. Perhaps2014.1
or2014.topic.1
for the file name, and the extension from the response header. Cconsider possible conflicts or annoyances down the line.What does refined mean? On that note, if the scripts stick to one-time delimiter assignments, some of the preprocessing (e.g., like the generated refined files) can be simply skipped!
It may be a good idea to reserve a filename and extension like
starting-a-business.2014.csv
(clean looking as opposed to .sorted.merged) for the preprocessed file that's ready for the next workflow step (e.g., transformation).Make sure that, the replacement filename pattern for
topicId.2014.1.html
andstarting-a-business.2014.csv
, and for examplestarting-a-business.2014.rdf
follow the same pattern.