Extract most important information from XML into simplistic format

okworx commented 1 year ago

Extract the information items from the XML into the data structure from #12. See the "ILCD Format in a nutshell" guide for details on what to find where.

First, we need to parse all the flows in order to have the information ready for looking it up later when we read the process(es).

Process

default namespace http://lca.jrc.it/ILCD/Process xmlns:common="http://lca.jrc.it/ILCD/Common"

Metadata

[ ] UUID (string) /processDataSet/processInformation/dataSetInformation/common:UUID
[ ] name (string) This consists of 4 parts: /processDataSet/processInformation/dataSetInformation/name/baseName /processDataSet/processInformation/dataSetInformation/name/treatmentStandardsRoutes /processDataSet/processInformation/dataSetInformation/name/mixAndLocationTypes /processDataSet/processInformation/dataSetInformation/name/functionalUnitFlowProperties which we want to concatenate with a semicolon + a space ; as separator characters.
[ ] reference year (number) /processDataSet/processInformation/time/common:referenceYear
[ ] valid until year (number) /processDataSet/processInformation/time/common:dataSetValidUntil
[ ] geographical representativity (location code, string) /processDataSet/processInformation/geography/locationOfOperationSupplyOrProduction/@location
[ ] reference product(s) The exchange with the reference flow is this one /processDataSet/exchanges/exchange[@dataSetInternalID=/processDataSet/processInformation/quantitativeReference/referenceToReferenceFlow]
- [ ] reference product internal id (integer) /processDataSet/processInformation/quantitativeReference/referenceToReferenceFlow] we will need this internally for parsing and processing
- [ ] reference product name (string) /processDataSet/exchanges/exchange[@dataSetInternalID=/processDataSet/processInformation/quantitativeReference/referenceToReferenceFlow]/referenceToFlowDataSet/common:shortDescription
- [ ] reference product amount (number) /processDataSet/exchanges/exchange[@dataSetInternalID=/processDataSet/processInformation/quantitativeReference/referenceToReferenceFlow]/resultingAmount This will need to be multiplied with the amount from the flow.

Inventory

[ ] for each exchange: here is the list of exchanges: /processDataSet/exchanges Each of them is uniquely identified by its dataSetInternalID attribute. One (or multiple, we only need to support one for now) of them is the reference product - the one whose @dataSetInternalID attribute matches the "reference product internal id" from above
- [ ] internal ID (integer) exchange/@dataSetInternalID we need that for internal processing
- [ ] flow name (string) exchange/referenceToFlowDataSet/common:shortDescription
- [ ] flow UUID (string) exchange/referenceToFlowDataSet/@refObjectId
- [ ] exchange direction (string) exchange/exchangeDirection
- [ ] exchange amount (double) exchange/resultingAmount
For each exchange, we'll need to look up the actual flow that is referenced (from the list of flows that we have parsed before) by its UUID and then read the flow's name, compartment, flow amount and unit. The amount from the exchange and the amount from the flow need to be multiplied and yield the actual resulting amount for this exchange.

Flow

default namespace http://lca.jrc.it/ILCD/Flow xmlns:common="http://lca.jrc.it/ILCD/Common"

[ ] name (string) /flowDataSet/flowInformation/dataSetInformation/name/baseName
[ ] UUID (string) /flowDataSet/flowInformation/dataSetInformation/common:UUID
[ ] compartment (string) /flowDataSet/flowInformation/dataSetInformation/classificationInformation/common:elementaryFlowCategorization/common:category[@level=2]
[ ] type of flow (string) /flowDataSet/modellingAndValidation/LCIMethod/typeOfDataSet
[ ] reference flow property amount (double) /flowDataSet/flowProperties/flowProperty[@dataSetInternalID=/flowDataSet/flowInformation/quantitativeReference/referenceToReferenceFlowProperty]/meanValue
[ ] reference flow property UUID (string) /flowDataSet/flowProperties/flowProperty[@dataSetInternalID=/flowDataSet/flowInformation/quantitativeReference/referenceToReferenceFlowProperty]/referenceToFlowPropertyDataSet/@refObjectId
With the UUID of the reference flow property, we can use the lookup function @grain11 wrote to lookup the unit.

shirubana commented 1 year ago

@okworx is the ilcd importer mostly finished or worked out? Where can I test it or where is the working repo or fork for that?

mfastudillo commented 1 year ago

@shirubana in this fork: https://github.com/mfastudillo/brightway2-io.

try:

from pathlib import Path
import bw2io
import bw2calc
import bw2data
from bw2io.importers.ilcd import ILCDImporter
import pandas as pd

bw2data.projects.set_current('ilcd_import')
bw2io.bw2setup()

path_to_example = Path('bw2io/data/examples/ilcd-example.zip')
so = ILCDImporter(dirpath= path_to_example,dbname='example_ilcd')
so.apply_strategies()

so.match_database('biosphere3',fields=['database','code'])
so.match_database(fields=['database','code'])
so.statistics()

so.drop_unlinked(True)
so.write_database()

You can pick an example of ILCD from the GLAD website, and tried with a few and it works. Quite a number of elementary flows are not matched and need to be dropped.

JosePauloSavioli commented 1 year ago

I had to do a similar process in another project, the Lavoisier, an LCI data format converter (https://github.com/JosePauloSavioli/Lavoisier).

The process worked with a reading function and a mapping class. The mapping class would have an output dictionary to populate, and a mapping dictionary with keys as XML fields and values as function calls to modify the data from these fields and populate the output dictionary. It was something like this (the reading function would take out namespaces automatically):


mapping = {
      "/processDataSet/processInformation/dataSetInformation/UUID": lambda x: setattr(self.output_dict, "UUID", self.modify_UUID(x))
}

The mapping dictionary is passed to the reading function. The reading function parses the XML and verifies if the element is in the mapping. If it is, it organizes the XML data in a dictionary (like the xmltodict library does) and calls the function bound to the element within the mapping dict with the data. The reading function then modifies the data for the new format and returns it to the lambda function, which sets it in the output dictionary. This is the basic flow, but for Lavoisier, the output dict would be an abstraction of the output format of the conversion.

The process worked well with LCA inventory data as it could perform the conversion in unique fields or sets of fields (like passing all data from one exchange to a class that could modify data for the new format). Still, it had some minor drawbacks related to parsing (as it is treated as a continuous flow of information, so the dataset is not loaded in memory).

I saw @mfastudillo is ahead on developing an ILCD importer for Brightway. This interests me a lot since one of the issues I have with converting datasets is that there is no software where I can import a .spold and an ILCD .zip file to compare information about it (https://github.com/JosePauloSavioli/Lavoisier/issues/3). I also had to study a lot the ILCD (and Ecospold 2) format to make the conversion possible, so I have an extensive knowledge of the format and on reading and working with data in it and have been through the struggle of mapping elementary flows between formats D:

@mfastudillo If it is in your interest, I would like to help you develop the importer. I'm open for a meeting or an exchange of emails if you want (my GitHub page has the email). I could fork it but I still have limited knowledge in Brighway, so I can help better in other ways.

mfastudillo commented 1 year ago

Hi @JosePauloSavioli , sure, contributions are more than welcome! I'll try to update the issues. The importer follows an extract - transform - load logic, and one of the most tricky things is the "extract" part where we parse different fields of the ilcd zip file into a list of dictionaries, this does not require much brightway knowledge but knowledge about the ilcd format is very useful

JosePauloSavioli commented 1 year ago

Hmmm, this is really a tricky part. I saw difficulties in 4 ways:

ILCD zip files can be single or multi process, so basically one can have the entire database in one file or separated between several of them. The directories also can be nested or in the zip file root folder
eILCD has an additional layer of information in the life cycle model dataset
References use the URI attribute, but sometimes it doesn't match or exist. There were cases of real world datasets where I got to use the refObjectId only to search because the referenced file was with a different version inside the zip file
Different software will add different 'flavours' to the ILCD. Examples: (i) OpenLCA, that adds an entire namespace of information on top of ILCD which can duplicate information and sometimes comes with EcoSpold 2 UUIDs (as the user can pick EcoSpold 2 flows to work with inside the software and the UUID is not modified upon export, really fun) and (ii) GaBi which adds new combinations of FlowProperties and UnitGroups

I think this can become pretty specific. @mfastudillo, would you mind sharing more about the difficulties that you are having in extracting? Do you prefer me to discuss this in the Issues of your fork?

mfastudillo commented 1 year ago

Hi @JosePauloSavioli , yes I think the issues in the forked repository are a better place to discuss the main issues

cmutel commented 10 months ago

Issue closed during cleanup for Brightcon 2023

brightway-lca / hackathons