brightway-lca / hackathons

Organization of Brightway hackathons
5 stars 1 forks source link

Extract most important information from XML into simplistic format #13

Closed okworx closed 10 months ago

okworx commented 1 year ago

Extract the information items from the XML into the data structure from #12. See the "ILCD Format in a nutshell" guide for details on what to find where.

First, we need to parse all the flows in order to have the information ready for looking it up later when we read the process(es).

Process

default namespace http://lca.jrc.it/ILCD/Process xmlns:common="http://lca.jrc.it/ILCD/Common"

Metadata

Inventory

Flow

default namespace http://lca.jrc.it/ILCD/Flow xmlns:common="http://lca.jrc.it/ILCD/Common"

shirubana commented 1 year ago

@okworx is the ilcd importer mostly finished or worked out? Where can I test it or where is the working repo or fork for that?

mfastudillo commented 1 year ago

@shirubana in this fork: https://github.com/mfastudillo/brightway2-io.

try:

from pathlib import Path
import bw2io
import bw2calc
import bw2data
from bw2io.importers.ilcd import ILCDImporter
import pandas as pd

bw2data.projects.set_current('ilcd_import')
bw2io.bw2setup()

path_to_example = Path('bw2io/data/examples/ilcd-example.zip')
so = ILCDImporter(dirpath= path_to_example,dbname='example_ilcd')
so.apply_strategies()

so.match_database('biosphere3',fields=['database','code'])
so.match_database(fields=['database','code'])
so.statistics()

so.drop_unlinked(True)
so.write_database()

You can pick an example of ILCD from the GLAD website, and tried with a few and it works. Quite a number of elementary flows are not matched and need to be dropped.

JosePauloSavioli commented 1 year ago

I had to do a similar process in another project, the Lavoisier, an LCI data format converter (https://github.com/JosePauloSavioli/Lavoisier).

The process worked with a reading function and a mapping class. The mapping class would have an output dictionary to populate, and a mapping dictionary with keys as XML fields and values as function calls to modify the data from these fields and populate the output dictionary. It was something like this (the reading function would take out namespaces automatically):


mapping = {
      "/processDataSet/processInformation/dataSetInformation/UUID": lambda x: setattr(self.output_dict, "UUID", self.modify_UUID(x))
}

The mapping dictionary is passed to the reading function. The reading function parses the XML and verifies if the element is in the mapping. If it is, it organizes the XML data in a dictionary (like the xmltodict library does) and calls the function bound to the element within the mapping dict with the data. The reading function then modifies the data for the new format and returns it to the lambda function, which sets it in the output dictionary. This is the basic flow, but for Lavoisier, the output dict would be an abstraction of the output format of the conversion.

The process worked well with LCA inventory data as it could perform the conversion in unique fields or sets of fields (like passing all data from one exchange to a class that could modify data for the new format). Still, it had some minor drawbacks related to parsing (as it is treated as a continuous flow of information, so the dataset is not loaded in memory).

I saw @mfastudillo is ahead on developing an ILCD importer for Brightway. This interests me a lot since one of the issues I have with converting datasets is that there is no software where I can import a .spold and an ILCD .zip file to compare information about it (https://github.com/JosePauloSavioli/Lavoisier/issues/3). I also had to study a lot the ILCD (and Ecospold 2) format to make the conversion possible, so I have an extensive knowledge of the format and on reading and working with data in it and have been through the struggle of mapping elementary flows between formats D:

@mfastudillo If it is in your interest, I would like to help you develop the importer. I'm open for a meeting or an exchange of emails if you want (my GitHub page has the email). I could fork it but I still have limited knowledge in Brighway, so I can help better in other ways.

mfastudillo commented 1 year ago

Hi @JosePauloSavioli , sure, contributions are more than welcome! I'll try to update the issues. The importer follows an extract - transform - load logic, and one of the most tricky things is the "extract" part where we parse different fields of the ilcd zip file into a list of dictionaries, this does not require much brightway knowledge but knowledge about the ilcd format is very useful

JosePauloSavioli commented 1 year ago

Hmmm, this is really a tricky part. I saw difficulties in 4 ways:

I think this can become pretty specific. @mfastudillo, would you mind sharing more about the difficulties that you are having in extracting? Do you prefer me to discuss this in the Issues of your fork?

mfastudillo commented 1 year ago

Hi @JosePauloSavioli , yes I think the issues in the forked repository are a better place to discuss the main issues

cmutel commented 10 months ago

Issue closed during cleanup for Brightcon 2023