brightway-lca / brightway2-data

Tools for the management of inventory databases and impact assessment methods. Part of the Brightway LCA framework.
https://docs.brightway.dev/
BSD 3-Clause "New" or "Revised" License
8 stars 21 forks source link

Convenience method to get database as a DataFrame #106

Closed cmutel closed 2 years ago

cmutel commented 2 years ago

premise has a convenience method to get a database as a dataframe of edges (exchanges), and this would be nice to have for the base Brightway as well.

wurst has an efficient method to get data from Brightway databases, and we will reuse this. wurst will then become a dependency of bw2data.

We won't use the premise schema - it is a bit too opinionated, though we will try to minimize differences. The column schema draft is:

target_id: int,
target_database: str,
target_code: str,
target_activity: Optional[str],
target_reference_product: Optional[str],
target_location: Optional[str],
target_unit: Optional[str],
target_type: Optional[str]
source_id: int,
source_database: str,
source_code: str,
source_activity: Optional[str],
source_product: Optional[str],  # Note different label
source_location: Optional[str],
source_unit: Optional[str],
source_type: Optional[str]
source_categories: Optional[str]  # Tuple concatenated with "::" as in `bw2io`
edge_amount: float,
edge_type: str,

Uncertainty fields can be added with a input flag.

The method should have the following properties:

Open questions:

cmutel commented 2 years ago

Instead of using wurst, we could retrieve data using a single SQL query. However, my implementation causes a memory leak and doesn't see faster in any case:

from bw2data.backends import ActivityDataset as AD, ExchangeDataset as ED
from time import time

Consumer = AD.alias()
Producer = AD.alias()

query = (
    ED.select(
        ED.data.alias('exchange_data'), 
        ED.type.alias('exchange_type'), 
        Consumer.data.alias('consumer_data'), 
        Consumer.code.alias('consumer_code'), 
        Consumer.database.alias('consumer_database'), 
        Consumer.location.alias('consumer_location'), 
        Consumer.name.alias('consumer_name'), 
        Consumer.product.alias('consumer_product'), 
        Consumer.type.alias('consumer_type'), 
        Producer.data.alias('producer_data'), 
        Producer.code.alias('producer_code'), 
        Producer.database.alias('producer_database'), 
        Producer.location.alias('producer_location'), 
        Producer.name.alias('producer_name'), 
        Producer.product.alias('producer_product'), 
        Producer.type.alias('producer_type'), 
    )    
    .join_from(ED, Consumer, on=((ED.output_code == Consumer.code) & (ED.output_database == Consumer.database)))
    .join_from(ED, Producer, on=((ED.input_code == Producer.code) & (ED.input_database == Producer.database)))
    .where(ED.output_database == 'ecoinvent 3.8 cutoff')
).dicts()

start = time()
for row in query:
    pass

end = time()
print(end - start)
cmutel commented 2 years ago

Can't only use wurst functions, as these won't work for the IOTable backend, which don't store Edge objects but only write processed arrays.

cmutel commented 2 years ago

Need wurst 0.3.3 for add_identifiers flag.

cmutel commented 2 years ago

Following suggestion from @michaelweinold, switch to target and source from consumer and producer.

cmutel commented 2 years ago

To use wurst extraction:

from wurst import extract_brightway2_databases
list_of_dicts = extract_brightway2_databases("<database name>", add_identifiers=True)