MI-FraunhoferIWM / data2rdf

About A generic pipeline that can be used to map raw data to RDF.
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

UUID of instances are NOT unique #18

Closed THuschle closed 5 months ago

THuschle commented 1 year ago

What is the issue: If you upload multiple files you can let the pipeline genereate UUIDs that are used in the namespace to differentiate the instances from multiple datasets (source files).

The issue is that these "UUID"s are not unique, the function to generate them is a hash-function. Therefore files with the same name will generate the same UUID which causes issues in the triple store, because more than one instance is named exactly the same.

The function is located in the csv_parser.py and the excel_parser.py

  def generate_file_uuid(self):
        # add file_uuid using unique hashsum of the file
        # with open(f_path, 'r', encoding=encoding) as file:

        self.id_hash = sha256sum(self.f_path)

How to solve: Replace hash function by uuid-generation

How result should look like:

Unique ID for every file independent from the input file name/path. Example where the uuid is used in the final ttl file: @prefix fileid: <https://w3id.org/cold-forging/a5862e20e3aaaa07c9ab364863fadd5c1901f723034c5ccecb33acc9a1016dcc#> .

yoavnash commented 5 months ago

Solved