MI-FraunhoferIWM / data2rdf

About A generic pipeline that can be used to map raw data to RDF.
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Mapping to Method not successfully #16

Closed THuschle closed 1 week ago

THuschle commented 1 year ago

What is the issue

When using Pandas 1.5.3, instances are not correctly mapped as described in the method graph / mapping.xlsx instead there is a additional instance generated. Therefore there are two instances: one with the the data and one with the value that are not connected. grafik

How to solve The issue doesn't occure when using Pandas 1.1.5 but this won't be supported in Python 3.10. Investigate what are the differences between the versions and how/why they effect the mapping of instances.

How should the solution look like The result should look the same as it already is with pandas 1.1.5. grafik

THuschle commented 1 year ago

Code used for starting pipeline:

working_folder = os.getcwd()
output_folder = os.path.join(working_folder,"output")

template = os.path.join(working_folder,'input','method-graph','ColdForging-Method.ttl')
mapping_file = os.path.join(working_folder,'input','mappings','mapping.xlsx')
raw_data = os.path.join(working_folder,'input','data','GCFG-Studie_Datenraum_v1_2023_02_02.xlsx')
location_mapping = os.path.join(working_folder,'input','mappings','location_mapping.xlsx')

parser = "excel"
parser_args = {
    "location_mapping_f_path":location_mapping,
   }

raw_data_path = os.path.join(working_folder,'input','data')

pipeline = AnnotationPipeline(
            None, 
            parser,
            parser_args,
            template,
            mapping_file,
            output_folder,
            # mapping_db_path,
            base_iri = "https://w3id.org/cold-forging",
            only_use_base_iri=False,
)

#iterate through folder and run pipline for each file
for entry in os.scandir(raw_data_path):
    title, file_format = os.path.splitext(entry.name)
    if file_format != '.xlsx':
        raise Warning('source folder not pure.')
    pipeline.input_file = entry.path
    pipeline.run_pipeline()
    pipeline.export_ttl(os.path.join(output_folder, uuid.uuid4().hex + '.ttl'))

Response of the pipeline:

Gitlab latest version (no issue):

Github latest version:

THuschle commented 1 year ago

Example Input file: 16MnCr5 (2).xlsx

Mapping files : location_mapping.xlsx mapping.xlsx

Method Graph:

ColdForging-Method.zip

THuschle commented 1 year ago

@pablo-de-andres @deepukr007 The downgrade of pandas to 1.1.5 worked for running the pipeline for now. Now we know we can use the version here on github for the projects right now and fix other issues on that bases. I would still keep this open and reformulate it to "Enable use of pandas 1.5.3 without mapping issues" because there is no support in python 3.10 for this version

pablo-de-andres commented 1 year ago

I think I would create another issue specific to that problem. Also, if the code does not work for Python 10, we should define that restriction on the configuration.

MBueschelberger commented 1 week ago

Starting from v2.0.0, we are using pandas =>2,<3 and the method graph is optional. Hence I will close this issue.