brightway-lca / bw_processing

Tools to create structured arrays in a common format
https://docs.brightway.dev/projects/bw-processing/
BSD 3-Clause "New" or "Revised" License
7 stars 5 forks source link
bw3 data life-cycle-assessment python

bw-processing

Library for storing numeric data for use in matrix-based calculations. Designed for use with the Brightway life cycle assessment framework.

PyPI Status Python Version License

Read the documentation at https://bw-processing.readthedocs.io/ Testing Codecov

pre-commit Black

Table of Contents

Background

The Brightway LCA framework has stored data used in constructing matrices in binary form as numpy arrays for years. This package is an evolution of that approach, and adds the following features:

Concepts

Data packages

Data objects can be vectors or arrays. Vectors will always produce the same matrix, while arrays have multiple possible values for each element of the matrix. Arrays are a generalization of the presamples library.

Data needed for matrix construction

Vectors versus arrays

Persistent versus dynamic

Persistent data is fixed, and can be completely loaded into memory and used directly or written to disk. Dynamic data is only resolved as the data is used, during matrix construction and iteration. Dynamic data is provided by interfaces - Python code that either generates the data, or wraps data coming from other software. There are many possible use cases for data interfaces, including:

Only the actual numerical values entered into the matrix is dynamic - the matrix index values (and optional flip vector) are still static, and need to be provided as Numpy arrays when adding dynamic resources.

Interfaces must implement a simple API. Dynamic vectors must support the python generator API, i.e. implement __next__().

Dynamic arrays must pretend to be Numpy arrays, in that they need to implement .shape and .__getitem__(args).

Here are some example interfaces (also given in bw_processing/examples/interfaces.py):

import numpy as np

class ExampleVectorInterface:
    def __init__(self):
        self.rng = np.random.default_rng()
        self.size = self.rng.integers(2, 10)

    def __next__(self):
        return self.rng.random(self.size)

class ExampleArrayInterface:
    def __init__(self):
        rng = np.random.default_rng()
        self.data = rng.random((rng.integers(2, 10), rng.integers(2, 10)))

    @property
    def shape(self):
        return self.data.shape

    def __getitem__(self, args):
        if args[1] >= self.shape[1]:
            raise IndexError
        return self.data[:, args[1]]

Interface dehydrating and rehydrating

Serialized datapackages cannot contain executable code, both because of our chosen data formats, and for security reasons. Therefore, when loading a datapackage with an interface, that interface object needs to be reconstituted as Python code - we call this cycle dehydration and rehydration. Dehydration happens automatically when a datapackage is finalized with finalize_serialization(), but rehydration needs to be done manually using rehydrate_interface(). For example:

from fsspec.implementations.zip import ZipFileSystem
from bw_processing import load_datapackage

my_dp = load_datapackage(ZipFileSystem("some-path.zip"))
my_dp.rehydrate_interface("some-resource-name", ExampleVectorInterface())

You can list the dehydrated interfaces present with .dehydrated_interfaces().

You can store useful information for the interface object initialization under the resource key config. This can be used in instantiating an interface if you pass initialize_with_config:

from fsspec.implementations.zip import ZipFileSystem
from bw_processing import load_datapackage
import requests
import numpy as np

class MyInterface:
    def __init__(self, url):
        self.url = url

    def __next__(self):
        return np.array(requests.get(self.url).json())

my_dp = load_datapackage(ZipFileSystem("some-path.zip"))
data_obj, resource_metadata = my_dp.get_resource("some-interface")
print(resource_metadata['config'])
>>> {"url": "example.com"}

my_dp.rehydrate_interface("some-interface", MyInterface, initialize_with_config=True)
# interface is substituted, need to retrieve it again
data_obj, resource_metadata = my_dp.get_resource("some-interface")
print(data_obj.url)
>>> "example.com"

Policies

Data package policies define how the data should be used. Policies apply to the entire data package; you may wish to adjust what is stored in which data packages to get the effect you desire.

There are two policies that apply to all data resources:

sum_intra_duplicates (default True): What to do if more than one data point for a given matrix element is given in each vector or array resource. If true, sum these values; otherwise, the last value provided is used.

sum_inter_duplicates (default: False): What to do if data from a given resource overlaps data already present in the matrix. If true, add the given value to the existing value; otherwise, the existing values will be overwritten.

There are three policies that apply only to array data resources, where a different column from the array is used in matrix construction each time the array is iterated over:

combinatorial (default False): If more than one array resource is available, this policy controls whether all possible combinations of columns are guaranteed to occur. If combinatorial is True, we use itertools.combinations to generate column indices for the respective arrays; if False, column indices are either completely random (with replacement) or sequential.

Note that you will get StopIteration if you exhaust all combinations when combinatorial is True.

Note that combinatorial cannot be True if infinite array interfaces are present.

sequential (default False): Array resources have multiple columns, each of which represents a valid system state. Default behaviour is to choose from these columns at random (including replacement), using a RNG and the data package seed value. If sequential is True, columns in each array will be chosen in order starting from column zero, and will rewind to zero if the end of the array is reached.

Note that if combinatorial is True, sequential is ignored; instead, the column indices are generated by itertools.combinations.

Please make sure you understand how combinatorial and sequential interact! There are three possibilities:

Install

Install using pip or conda (channel cmutel). Depends on numpy and pandas (for reading and writing CSVs).

Has no explicit or implicit dependence on any other part of Brightway.

Usage

The main interface for using this library is the Datapackage class. However, instead of creating an instance of this class directly, you should use the utility functions create_datapackage and load_datapackage.

A datapackage is a set of file objects (either in-memory or on disk) that includes a metadata file object, and one or more data resource files objects. The metadata file object includes both generic metadata (i.e. when it was created, the data license) and metadata specific to each data resource (how it can be used in calculations, its relationship to other data resources). Datapackages follow the data package standard.

Creating datapackages

Datapackages are created using create_datapackage, which takes the following arguments:

Calling this function return an instance of Datapackage. You still need to add data.

Contributing

Your contribution is welcome! Please follow the pull request workflow, even for minor changes.

When contributing to this repository with a major change, please first discuss the change you wish to make via issue, email, or any other method with the owners of this repository.

Please note we have a code of conduct, please follow it in all your interactions with the project.

Documentation and coding standards

Maintainers

License

BSD-3-Clause. Copyright 2020 Chris Mutel.