Convenience method to get database as a DataFrame

cmutel commented 2 years ago

premise has a convenience method to get a database as a dataframe of edges (exchanges), and this would be nice to have for the base Brightway as well.

wurst has an efficient method to get data from Brightway databases, and we will reuse this. wurst will then become a dependency of bw2data.

We won't use the premise schema - it is a bit too opinionated, though we will try to minimize differences. The column schema draft is:

target_id: int,
target_database: str,
target_code: str,
target_activity: Optional[str],
target_reference_product: Optional[str],
target_location: Optional[str],
target_unit: Optional[str],
target_type: Optional[str]
source_id: int,
source_database: str,
source_code: str,
source_activity: Optional[str],
source_product: Optional[str],  # Note different label
source_location: Optional[str],
source_unit: Optional[str],
source_type: Optional[str]
source_categories: Optional[str]  # Tuple concatenated with "::" as in `bw2io`
edge_amount: float,
edge_type: str,

Uncertainty fields can be added with a input flag.

The method should have the following properties:

Easy to add additional common columns
Can pass custom formatter to create different DataFrame schema
Compatible with pola.rs

Open questions:

~~MultiIndex instead of prefixes?~~
~~Handling of elementary edges is weird (producer), but creating a section just for the is also weird. This is why we have used input and output in the past, but this isn't perfect either.~~
~~Edge properties (e.g. carbon content) and classifications (e.g. ISIC)?~~ Can be added later if needed.

cmutel commented 2 years ago

Instead of using wurst, we could retrieve data using a single SQL query. However, my implementation causes a memory leak and doesn't see faster in any case:

from bw2data.backends import ActivityDataset as AD, ExchangeDataset as ED
from time import time

Consumer = AD.alias()
Producer = AD.alias()

query = (
    ED.select(
        ED.data.alias('exchange_data'), 
        ED.type.alias('exchange_type'), 
        Consumer.data.alias('consumer_data'), 
        Consumer.code.alias('consumer_code'), 
        Consumer.database.alias('consumer_database'), 
        Consumer.location.alias('consumer_location'), 
        Consumer.name.alias('consumer_name'), 
        Consumer.product.alias('consumer_product'), 
        Consumer.type.alias('consumer_type'), 
        Producer.data.alias('producer_data'), 
        Producer.code.alias('producer_code'), 
        Producer.database.alias('producer_database'), 
        Producer.location.alias('producer_location'), 
        Producer.name.alias('producer_name'), 
        Producer.product.alias('producer_product'), 
        Producer.type.alias('producer_type'), 
    )    
    .join_from(ED, Consumer, on=((ED.output_code == Consumer.code) & (ED.output_database == Consumer.database)))
    .join_from(ED, Producer, on=((ED.input_code == Producer.code) & (ED.input_database == Producer.database)))
    .where(ED.output_database == 'ecoinvent 3.8 cutoff')
).dicts()

start = time()
for row in query:
    pass

end = time()
print(end - start)

cmutel commented 2 years ago

Can't only use wurst functions, as these won't work for the IOTable backend, which don't store Edge objects but only write processed arrays.

cmutel commented 2 years ago

Need wurst 0.3.3 for add_identifiers flag.

cmutel commented 2 years ago

Following suggestion from @michaelweinold, switch to target and source from consumer and producer.

cmutel commented 2 years ago

To use wurst extraction:

from wurst import extract_brightway2_databases
list_of_dicts = extract_brightway2_databases("<database name>", add_identifiers=True)

brightway-lca / brightway2-data

Convenience method to get database as a DataFrame #106