brightway-lca / brightway2-data

Tools for the management of inventory databases and impact assessment methods. Part of the Brightway LCA framework.
https://docs.brightway.dev/
BSD 3-Clause "New" or "Revised" License
11 stars 24 forks source link

db.search location filter does not work for locations with punctuation characters (spaces or hyphens) #63

Open aleksandra-kim opened 5 years ago

aleksandra-kim commented 5 years ago

Original report by Tomas Navarrete Gutierrez (Bitbucket: tomas_navarrete, ).


Observed state

When you search using the db.searchfacility, you can add filters for location. If the location value contains hyphens ('CA-SK', ‘UN-OCEANIA', ‘IN-North-eastern grid’) or spaces ('Europe without Switzerland', ‘loco location’, 'RER w/o DE+NL+RU') the search does not yield the expected results.

# Imagine mydb is ei35
db = Database('mydb')
db.search('production', filter={'location’:'CA-SK'})
> Excluding 9359 filtered results
> []

but only using the :flag_ca: ‘CA’ part of the location works:

db.search('production', filter={'location’:'CA-SK'})
> Excluding 6313 filtered results                                                                                              
['brick production facility construction' (unit, CA-QC, None),                                                               
 'treatment, sludge from pulp and paper production, landfarming' (kilogram, CA-QC, None),                                    
 'carton board box production service, with gravure printing' (kilogram, CA-QC, None)
...
 'electricity production, hydro, pumped storage' (kilowatt hour, CA-NB, None),
 'electricity production, hydro, pumped storage' (kilowatt hour, CA-AB, None),
 'electricity production, hydro, pumped storage' (kilowatt hour, CA-ON, None),
 'electricity production, hydro, pumped storage' (kilowatt hour, CA-SK, None)]

I tested this using bw2data 3.5 and whoosh 2.7.4

Suggestion

My suggestion is to change the schema, and use ID for the location field instead of TEXT.

Whoosh documentation states that:

Use ID for fields like url or path (the URL or file path of a document), date, category – fields where the value must be treated as a whole, and each document only has one value for the field.

I believe that there is only one location for an activity document, and that it is more useful to treat the location as a whole.

Here is what may need to change in file schema.py :

bw2_schema = Schema(
    name=TEXT(stored=True, sortable=True),
    comment=TEXT(stored=True),
    product=TEXT(stored=True, sortable=True),
    categories=TEXT(stored=True),
    location=ID(stored=True, sortable=True),
    database=TEXT(stored=True),
    code=ID(unique=True, stored=True),
)
aleksandra-kim commented 5 years ago

Original comment by Chris Mutel (Bitbucket: cmutel, GitHub: cmutel).


This is a reasonable request. The ultimate decision on whether it is better to use id or text probably depends on how important you value partial matches versus exact matches. I don’t really have an opinion, and in general just iterating over the activities or using the search index manually (without the Brightway wrapping) might be a good alternative.

Unfortunately, this change would require migrations to the new search index schema, so the tests become complicated. Any volunteers to write these tests?

aleksandra-kim commented 5 years ago

Original comment by Tomas Navarrete Gutierrez (Bitbucket: tomas_navarrete, ).


using the search index manually (without the Brightway wrapping) might be a good alternative

I’ll go for that one I think.
I thought about the migrations + testing and well, that certainly takes quite some time for not much gain.

I suppose the issue can be closed.

CHarpprecht commented 1 year ago

I am experiencing the same problem with the location filter.

Additionally, the following searches involving the name of the reference product do not yield any results even though the processes exist:

db.search('reference_product', filter={'name’:process_name})
db.search('process_name', filter={'product’:reference_product})

Just searching for the process_name works, e.g.:

db.search('process_name')

I am using ecoinvent3.8 cutoff and: brightway2=2.4 bw2data=3.6.5

BenPortner commented 1 year ago

Hi @CHarpprecht,

thanks for using Brightway! I'll try to help you with this one but I need a little more info from your side.

db.search('reference_product', filter={'name’:process_name})

I don't quite understand your code yet. Above line will look for the literal string 'reference_product' in the search index. Do you have an activity named like that in your database? Or perhaps you meant to write db.search(reference_product, filter={'name':process_name}) to look for the string stored in the variable process_name (note the omitted single quotes). Also please make sure to use quotes consistently in your code (right now you are mixing ' and ). Can you confirm that these two things are correct in your code and let me know if it works?

Thanks Ben

CHarpprecht commented 1 year ago

Hi @BenPortner, thank you for pointing out the syntax errors. In the original code, these are variables and I made some mistakes when trying to generalize it for github.

You are right, it should be:

db.search(reference_product, filter={'name':process_name})
db.search(process_name, filter={'product':reference_product})

with reference_product and process_name being variables for reference products and process names I would like to search for. I double-checked again the original code and it is correct there. The 2 searches above do not yield any results, even though they should, since the activities are in the database.

If I run the searches without the filter, the respective processes are in the yielded search results. So the following searches are working:

db.search(reference_product)
db.search(process_name)

But then it is necessary to do some additional filtering afterwards outside of the db.search() function of brightway in order to identify the process with the correct reference product and location.

BenPortner commented 1 year ago

Hi @CHarpprecht,

Thanks for your description. I was able to re-create the issue on my system. I can confirm that this is a problem with Brightway (or rather with Whoosh, the library Brightway uses for searching) and not a user error.

Why does this error occur

Taking a concrete example: db.search("*", filter={"product": "concrete, 25mpa"}) fails because it will try to find the string "concrete, 25mpa" in the search index for the field "product". However, there is no such string in the search index. Because the product field is processed during creation of the search index, "concrete, 25mpa" is split into "concrete" and "25mpa". This means, db.search("*", filter={"product": "concrete"}) yields results and db.search("*", filter={"product": "25mpa"}) yields results but db.search("*", filter={"product": "concrete, 25mpa"}) yields no results.

Work-around (for users)

Please use Database.get(product=..., name=...) for now as this is more reliable

Fixes (for developers)

To fix this issue, we need to either a) add the original string to the search index in addition to the processed parts or b) process the filter strings in the same way the original dataset fields were processed (i.e. tokenization, filtering stop words...)

CHarpprecht commented 1 year ago

Hi @BenPortner , thank you very much for looking into this and for providing a work-around!

dlr-bf commented 1 year ago

Hi,

I am working together with @CHarpprecht and have some update concerning the issue.

According to our experiments, the Database.get() interface accepts only the bw code. Hence the proposed workaround with Database.get(product=..., name=...) does not work for our usecase. We found a solution for us that works (basically using the search to narrow it down and then filter the candidates ourselves) but wanted to document, that the workaround did not support our usecase.

Best,

Ben

BenPortner commented 1 year ago

Hi @dlr-bf,

I forgot to mention: You need Brightway 2.5 for the workaround to work.

Best regards Ben

tngTUDOR commented 1 year ago

@CHarpprecht & @dlr-bf Here is another option to use, instead of "search": Use the bw2data.query classes Query and Filter (compatible with brightway 2)


import bw2data as bd
import bw2data.query as bdq # for the Query and Filter classes

bd.projects.set_current('ecoinvent391')
# get an instance of the database to query
db = bd.Database('ecoinvent391-cutoff')
# The query and filter classes work on the full data from your database
# The line below might take a few seconds: it's loading all the database in memory, but search is performant!
db_data = db.load()
# load the data

# Create an empty query
a_query = bdq.Query()
# Create the filters
# checkout bdq.Filter? to get an idea of what is possible
name_filter = bdq.Filter("name", "has", "production")
location_filter = bdq.Filter("location", "is", "CA-SK")
# build the query by adding the filters we want
a_query.add(name_filter)
a_query.add(location_filter)
# and now, search!
results = a_query(db_data)
# What did we get?
for r in results:
    print(r)
('ecoinvent391-cutoff', '0102c396062fc5981b49b509986d269d')
('ecoinvent391-cutoff', '6da28f05f8aaf33398983c97387f9f55')
('ecoinvent391-cutoff', '01071faa13220743af5d2d574b7a34d0')
('ecoinvent391-cutoff', 'dc0150633520f4460198e83a77ac137c')
('ecoinvent391-cutoff', '03a69791062b9296fecc70f88867d394')
('ecoinvent391-cutoff', '55b91a957f80ed050b40c676dc1b8684')
('ecoinvent391-cutoff', '93d82d569504776269bd6df428c5e06e')
('ecoinvent391-cutoff', 'f783e0a8e3436e2efcf0fc1dd858016b')
('ecoinvent391-cutoff', '0153a1ab7fb39f5bd163721e3bd4a109')
('ecoinvent391-cutoff', '17c9670f75f1b10a3ea8850c4bffdccb')
('ecoinvent391-cutoff', 'd039918b419a0ef61e37a32a690611bf')
('ecoinvent391-cutoff', '4716338521f6ce23fab7af3e19bbcb91')
('ecoinvent391-cutoff', 'dd48ad41380660ab781b66f3c4925da2')
('ecoinvent391-cutoff', 'fa107d754aada1573bbdaaf5fbe3e34d')
('ecoinvent391-cutoff', '870a2c52aec6235e38893b83473c5c14')
('ecoinvent391-cutoff', '59d0508f8108714e1d9da150ea33ebd4')
('ecoinvent391-cutoff', '7c061eee793d60eb6f9ad7e1084ca7dd')
# tuples with db name and activity id
# Now, let's make it useful:
for db_name, db_key in results:
    a = db.get(db_key)
    print(a)
'electricity production, hydro, pumped storage' (kilowatt hour, CA-SK, None)
'electricity production, photovoltaic, 3kWp slanted-roof installation, single-Si, panel, mounted' (kilowatt hour, CA-SK, None)
'electricity production, wind, 1-3MW turbine, onshore' (kilowatt hour, CA-SK, None)
'electricity production, hydro, run-of-river' (kilowatt hour, CA-SK, None)
'electricity production, natural gas, combined cycle power plant' (kilowatt hour, CA-SK, None)
'fava bean production' (kilogram, CA-SK, None)
'electricity production, hydro, reservoir, non-alpine region' (kilowatt hour, CA-SK, None)
'electricity production, lignite' (kilowatt hour, CA-SK, None)
'lentil production' (kilogram, CA-SK, None)
'electricity production, photovoltaic, 570kWp open ground installation, multi-Si' (kilowatt hour, CA-SK, None)
'electricity production, natural gas, conventional power plant' (kilowatt hour, CA-SK, None)
'electricity production, wind, >3MW turbine, onshore' (kilowatt hour, CA-SK, None)
'electricity, high voltage, production mix' (kilowatt hour, CA-SK, None)
'electricity production, photovoltaic, 3kWp slanted-roof installation, multi-Si, panel, mounted' (kilowatt hour, CA-SK, None)
'electricity production, oil' (kilowatt hour, CA-SK, None)
'protein pea production' (kilogram, CA-SK, None)
'electricity production, wind, <1MW turbine, onshore' (kilowatt hour, CA-SK, None)
tngTUDOR commented 1 year ago

@CHarpprecht & @dlr-bf Here's an example regarding your example ref product (concrete 25MPa)

another_filter = bdq.Filter("reference product", "has", "concrete, 25MPa")
concrete_query = bdq.Query()
concrete_query.add(another_filter)
results = concrete_query(db_data)
for db_name, db_key in results:
    a = db.get(db_key)
    print(a)

'concrete production, 25MPa, for building construction, for interior use, with cement ZN/D, with 50% RC-C aggregates' (cubic meter, CH, None)
'market for concrete, 25MPa' (cubic meter, CH, None)
'concrete production, 25MPa, for building construction, with cement, CP III' (cubic meter, BR, None)
'concrete production, 25MPa, for building construction, with cement, CEM II/A' (cubic meter, CH, None)
'market for concrete, 25MPa' (cubic meter, RNA, None)
'concrete production, 25MPa, for building construction, with cement, CEM II/B' (cubic meter, CH, None)
'concrete production, 25MPa, for building construction, with cement, CEM II/B' (cubic meter, RoW, None)
'concrete production, 25MPa, for building construction, for interior use, with cement ZN/D' (cubic meter, CH, None)
'concrete production, 25MPa, for building construction, with cement, CEM III/A' (cubic meter, GLO, None)
'concrete production, 25MPa, for building construction, for interior use, with cement ZN/D, with <30% RC-M aggregates' (cubic meter, CH, None)
'concrete production, 25MPa, for building construction, with cement, CP-II-F' (cubic meter, BR, None)
'concrete production, 25MPa, for building construction, for interior use, with cement, Portland' (cubic meter, RoW, None)
'concrete production, 25MPa, for building construction, for interior use, with cement, Portland' (cubic meter, North America without Quebec, None)
'market for concrete, 25MPa' (cubic meter, ZA, None)
'concrete production, 25MPa, for building construction, exposure class XC1, with cement, unspecified' (cubic meter, AT, None)
'market for concrete, 25MPa' (cubic meter, RoW, None)
'concrete production, 25MPa, for building construction, with cement, CEM II/A' (cubic meter, RoW, None)
'concrete production, 25MPa, for building construction, exposure class XC2, with cement, unspecified' (cubic meter, AT, None)
'concrete production, 25MPa, for building construction, with cement, CP-II-E' (cubic meter, BR, None)
'concrete production, 25MPa, for building construction, with cement, CEM II/B-V' (cubic meter, ZA, None)
'concrete production, 25MPa, for building construction, for interior use, with cement, Portland' (cubic meter, CA-QC, None)
'market for concrete, 25MPa' (cubic meter, BR, None)
'market for concrete, 25MPa' (cubic meter, AT, None)

and now extened for location:

another_filter = bdq.Filter("reference product", "has", "concrete, 25MPa")
another_location_filter = bdq.Filter("location", "is", "North America without Quebec")
concrete_query = bdq.Query()
concrete_query.add(another_filter)
concrete_query.add(another_location_filter)
results = concrete_query(db_data)
for db_name, db_key in results:
    a = db.get(db_key)
    print(a)

'concrete production, 25MPa, for building construction, for interior use, with cement, Portland' (cubic meter, North America without Quebec, None)