CLARIAH / COW

Integrated CSV to RDF converter, using CSVW and nanopublications
MIT License
46 stars 9 forks source link

Option to disable, that file path is added to output #93

Open Sozialarchiv opened 4 years ago

Sozialarchiv commented 4 years ago

Is there a option to avoid that the full filepath (e.g. only basename) is added automatically to the output.

In some cases this can be a privacy issue. (the path can contain a username for example)

ns5:db490c7-50c3-4ad6-b0df-d48fe3dfa984 {
    <https://iisg.amsterdam/48422b27cba4a0e68c9c66d0f7ca614ec688dfcb> ns7:path "/tmp/V2RY7QULW9/web_interface/91a7c0a271826cf3e7e5b470dfd5e345/imf_gdppc.csv"^^xsd:string ;
        ns7:sha1_hash "48422b27cba4a0e68c9c66d0f7ca614ec688dfcb"^^xsd:string .
Sozialarchiv commented 4 years ago

This issue is maybe related to #36

melvinroest commented 4 years ago

The user friendly comment In a default conversion one gets something like this:

<https://iisg.amsterdam/mybase/405cbee5590602b3d786d315219350543d25148f> <https://iisg.amsterdam/mybase/vocab/path> "/home/path/to/at-list.csv"^^<http://www.w3.org/2001/XMLSchema#string> <urn:uuid:7ece5d83-d53a-49bc-ae54-bfef5ed0b09a> .

But the question becomes, what should it become? For example, this would look weird:

<https://iisg.amsterdam/mybase/405cbee5590602b3d786d315219350543d25148f> <https://iisg.amsterdam/mybase/vocab/path> "at-list.csv"^^<http://www.w3.org/2001/XMLSchema#string> <urn:uuid:7ece5d83-d53a-49bc-ae54-bfef5ed0b09a> .

That's simply the name of the file, not a path. Relative paths are also not really an option as they have the same dangers as absolute paths.

The dev comment I took a dive in the code, there needs to be some discussion on the provenance graph specifically.

# A URI that represents the version of the file being converted
self.dataset_version_uri = SDR[source_hash]
self.add((self.dataset_version_uri, SDV['path'], Literal(file_name, datatype=XSD.string)))
self.add((self.dataset_version_uri, SDV['sha1_hash'], Literal(source_hash, datatype=XSD.string)))

# ----
# The nanopublication graph
# ----
name = (os.path.basename(file_name)).split('.')[0]
self.uri = SDR[f"{name}/nanopublication/{hash_part}"]

Note: file_name is the full path (for legacy reasons).

A possible change would be to move name above the line with SDV['path']. The issue is though that privacy sensitive information may still be disclosed if a relative path is shown instead of an absolute one. The only save thing to do here is to include the name of the file only, without any path information. But that does change the semantics of SDV['path']. Therefore, a discussion within Clariah is needed.