Dealing with Files - Githubissues

Issue ported from old casework github repo (issue 19). Original author: casework

For the "File" property bundle, what are considered to have a file system type? Besides the obvious ones like NTFS and EXT4 there are also things like TAR and 7z that are included because they have file-like properties (filepath, MAC times, etc) Do we consider other file systems that don't have the traditional file-like properties but have other properties to extract a file? (eg. SQLiteBlob, Encryption, Compression, etc.) These file types obviously don't have MAC times and filePath, but they have other properties used to grab files. SQLiteBlob uses the parameters: tableName, columnName, and rowCondition or rowIndex Encryption uses the parameters: key, IV, method, and cipher mode Encoding uses the parameters: method

Replies by mike-parkhill:

Is SQLiteBlob intended to be used for non blob data too? For example, if pulling a chat message from a sqliteTable, it would likely be stored in the db as text rather than a blob. Is that okay? Would it make more sense to have a separate non-blob type or make the existing type more genericly named?

In fact, looking more closely at it, does it need to even reference SQLite? Does it need to be that specific, or would a more generic DatabaseObject be appropriate?

What are the expectations for embedding/attaching actual file bytes? It looks like the ContentData.dataPayload string could take an encoded string representation, but that could get very large for some data types (e.g. video, hi-res photo, etc.). Am I understanding the use-case for that field properly?

If it's all to be embedded is the expected encoding scheme defined (e.g. Base64, uuencode)?
Or is there a way of "attaching" data externally? - maybe we could consider delivering a "package" or "container" file instead of just the json. For example, create a zip/tar that includes the json file which then simply references the raw data stored elsewhere in the zip/tar.

Replies by sbarnum:

On the original question expressed for this issue "For the "File" property bundle, what are considered to have a file system type?", I do not have an exhaustive list of types but I would suggest that it should apply only to things that are actual recognized file system types or serve a function VERY much like one. I would place archives like TAR, ZIP, etc in this latter category.

I would certainly not consider things that "don't have the traditional file-like properties but have other properties to extract a file" as file systems. A SQLiteBlob is simply a blob of data in a SQLite database that is stored exactly as it was entered. It's "location" within that database is managed by the database not the blob and the management function is by a database not a file system. I would not consider Encryption or Encoding anything similar to a file system. They are merely transformations on a body of data. The parameters used for that transformation bear no affinity to the function of a file system. These are all different things from file systems and should be captured/expressed as property bundles with semantics independent from file systems.

I would agree with comments above that it looks like the current properties of the SQLiteBlob property bundle are truly properties of the location of a database object within a database rather than properties unique to a SQLiteBlob.

I would agree with the proposed suggestion of changing its name to DatabaseObject or even DatabaseObjectLocation. This would seem to provide support for expressing location details of non-blob data base objects within SQLite DBs as well as within other types of databases. Initially, we could just change the name but if we do generalize the name it would likely benefit from some community consideration and discussion whether there are other relevant properties for expressing object locations within databases other than SQLite (e.g. non relational DBs).

For now, users of CASE/UCO v0.1.0 should likely just use the SQLiteBlob property bundle for this purpose even without an optimal current name for it.

On the question of "expectations for embedding/attaching actual file bytes", ContentData currently provides capability for either embedding actual bytes or referencing external storage of those bytes.

The ContentData.dataPayload property is intended for supporting the direct embedding of content data within the CASE/UCO data. This is obviously less practical for large data than for smaller data. The field is currently defaulted to presume base64 encoding. It may be desirable in future versions to support different encodings as well. This would likely require an additional property to assert the encoding type.

The ContentData.dataPayloadReferenceURL is intended for supporting the referencing of content data stored separate from the CASE/UCO data stream.

Continued implementations and feedback will determine whether the approach of these fields is adequate to support the necessary use cases.

Replies by mike-parkhill:

I would think it should be acceptable (at least for a v1.0) to simply state that the standard expects base64 encoding. If that's clear in the implementation documentation then I don't think it should be a problem. We just need to make sure there's no ambiguity.

Does the dataPayloadReferenceURL expect an internet address, or would a relative file reference be acceptable? I would expect that given the nature of a lot of the materials that our users work with that passing a file would be preferable to hosting it on an accessible server.

For the file system type part that originated this question (sorry for polluting the thread), are the types expected to be defined an a formal enum or limited vocabulary? If so, have you thought about how readily that can be extended as new types become popular? (this question applies to all uses of such mechanisms)

Reply by casework:

Sounds good. To summarize, the "File" property bundle would only be used IF the data is a file in the sense it has a file path and/or MAC times. In these cases the "File" property bundle will also have "FileSystemType" property defining the file system the metadata is from. Data from a non-traditional source like a database or the result of a transformation (encoding, encryptions, etc) will not have a "File" property bundle at all.

In this case, it is possible to have a blank Trace object containing no property bundles if the tool did not collect any information to warrant creating a "ContentData" property bundle.

In fact, looking more closely at it, does it need to even reference SQLite? Does it need to be that specific, or would a more generic DatabaseObject be appropriate?

I have no problem with that. Currently SQLiteBlob represents the relationship of a piece of data inside a database through the use of a SQL query. It technically doesn't have to be a blob. It was originally called this because its main use case is to reference a file (like an image attachment) stored inside a database as a blob. However, I suggest naming it something along the lines of "SQLQuery". We can then create a different property bundle for non-SQL databases if needed.

are the types expected to be defined an a formal enum or limited vocabulary? If so, have you thought about how readily that can be extended as new types become popular? (this question applies to all uses of such mechanisms)

Currently, enums/vocabularies are defined using owl:NamedIndividual. Extending the vocabulary is just as easy as extending property bundles. The new type simply needs to be defined somewhere in the graph or imported from their own external ontology. They can then later decide to make a request for it be included in the CASE ontology or keep it as a one-off

:MyNewFS rdf:type owl:NamedIndividual, case:FileSystemType ; rdfs:comment "A new file system type!" .

casework / CASE

Dealing with Files #10