Open aleksandra-kim opened 5 years ago
Original comment by Chris Mutel (Bitbucket: cmutel, GitHub: cmutel).
This is a reasonable request. The ultimate decision on whether it is better to use id
or text
probably depends on how important you value partial matches versus exact matches. I don’t really have an opinion, and in general just iterating over the activities or using the search index manually (without the Brightway wrapping) might be a good alternative.
Unfortunately, this change would require migrations to the new search index schema, so the tests become complicated. Any volunteers to write these tests?
Original comment by Tomas Navarrete Gutierrez (Bitbucket: tomas_navarrete, ).
using the search index manually (without the Brightway wrapping) might be a good alternative
I’ll go for that one I think.
I thought about the migrations + testing and well, that certainly takes quite some time for not much gain.
I suppose the issue can be closed.
I am experiencing the same problem with the location filter.
Additionally, the following searches involving the name of the reference product do not yield any results even though the processes exist:
db.search('reference_product', filter={'name’:process_name})
db.search('process_name', filter={'product’:reference_product})
Just searching for the process_name works, e.g.:
db.search('process_name')
I am using ecoinvent3.8 cutoff and:
brightway2=2.4
bw2data=3.6.5
Hi @CHarpprecht,
thanks for using Brightway! I'll try to help you with this one but I need a little more info from your side.
db.search('reference_product', filter={'name’:process_name})
I don't quite understand your code yet. Above line will look for the literal string 'reference_product'
in the search index. Do you have an activity named like that in your database? Or perhaps you meant to write db.search(reference_product, filter={'name':process_name})
to look for the string stored in the variable process_name
(note the omitted single quotes). Also please make sure to use quotes consistently in your code (right now you are mixing '
and ’
). Can you confirm that these two things are correct in your code and let me know if it works?
Thanks Ben
Hi @BenPortner, thank you for pointing out the syntax errors. In the original code, these are variables and I made some mistakes when trying to generalize it for github.
You are right, it should be:
db.search(reference_product, filter={'name':process_name})
db.search(process_name, filter={'product':reference_product})
with reference_product
and process_name
being variables for reference products and process names I would like to search for.
I double-checked again the original code and it is correct there. The 2 searches above do not yield any results, even though they should, since the activities are in the database.
If I run the searches without the filter, the respective processes are in the yielded search results. So the following searches are working:
db.search(reference_product)
db.search(process_name)
But then it is necessary to do some additional filtering afterwards outside of the db.search() function of brightway in order to identify the process with the correct reference product and location.
Hi @CHarpprecht,
Thanks for your description. I was able to re-create the issue on my system. I can confirm that this is a problem with Brightway (or rather with Whoosh, the library Brightway uses for searching) and not a user error.
Taking a concrete example: db.search("*", filter={"product": "concrete, 25mpa"})
fails because it will try to find the string "concrete, 25mpa" in the search index for the field "product". However, there is no such string in the search index. Because the product field is processed during creation of the search index, "concrete, 25mpa" is split into "concrete" and "25mpa". This means, db.search("*", filter={"product": "concrete"})
yields results and db.search("*", filter={"product": "25mpa"})
yields results but db.search("*", filter={"product": "concrete, 25mpa"})
yields no results.
Please use Database.get(product=..., name=...)
for now as this is more reliable
To fix this issue, we need to either a) add the original string to the search index in addition to the processed parts or b) process the filter strings in the same way the original dataset fields were processed (i.e. tokenization, filtering stop words...)
Hi @BenPortner , thank you very much for looking into this and for providing a work-around!
Hi,
I am working together with @CHarpprecht and have some update concerning the issue.
According to our experiments, the Database.get() interface accepts only the bw code. Hence the proposed workaround with
Database.get(product=..., name=...)
does not work for our usecase.
We found a solution for us that works (basically using the search to narrow it down and then filter the candidates ourselves) but wanted to document, that the workaround did not support our usecase.
Best,
Ben
Hi @dlr-bf,
I forgot to mention: You need Brightway 2.5 for the workaround to work.
Best regards Ben
@CHarpprecht & @dlr-bf Here is another option to use, instead of "search": Use the bw2data.query classes Query
and Filter
(compatible with brightway 2)
import bw2data as bd
import bw2data.query as bdq # for the Query and Filter classes
bd.projects.set_current('ecoinvent391')
# get an instance of the database to query
db = bd.Database('ecoinvent391-cutoff')
# The query and filter classes work on the full data from your database
# The line below might take a few seconds: it's loading all the database in memory, but search is performant!
db_data = db.load()
# load the data
# Create an empty query
a_query = bdq.Query()
# Create the filters
# checkout bdq.Filter? to get an idea of what is possible
name_filter = bdq.Filter("name", "has", "production")
location_filter = bdq.Filter("location", "is", "CA-SK")
# build the query by adding the filters we want
a_query.add(name_filter)
a_query.add(location_filter)
# and now, search!
results = a_query(db_data)
# What did we get?
for r in results:
print(r)
('ecoinvent391-cutoff', '0102c396062fc5981b49b509986d269d')
('ecoinvent391-cutoff', '6da28f05f8aaf33398983c97387f9f55')
('ecoinvent391-cutoff', '01071faa13220743af5d2d574b7a34d0')
('ecoinvent391-cutoff', 'dc0150633520f4460198e83a77ac137c')
('ecoinvent391-cutoff', '03a69791062b9296fecc70f88867d394')
('ecoinvent391-cutoff', '55b91a957f80ed050b40c676dc1b8684')
('ecoinvent391-cutoff', '93d82d569504776269bd6df428c5e06e')
('ecoinvent391-cutoff', 'f783e0a8e3436e2efcf0fc1dd858016b')
('ecoinvent391-cutoff', '0153a1ab7fb39f5bd163721e3bd4a109')
('ecoinvent391-cutoff', '17c9670f75f1b10a3ea8850c4bffdccb')
('ecoinvent391-cutoff', 'd039918b419a0ef61e37a32a690611bf')
('ecoinvent391-cutoff', '4716338521f6ce23fab7af3e19bbcb91')
('ecoinvent391-cutoff', 'dd48ad41380660ab781b66f3c4925da2')
('ecoinvent391-cutoff', 'fa107d754aada1573bbdaaf5fbe3e34d')
('ecoinvent391-cutoff', '870a2c52aec6235e38893b83473c5c14')
('ecoinvent391-cutoff', '59d0508f8108714e1d9da150ea33ebd4')
('ecoinvent391-cutoff', '7c061eee793d60eb6f9ad7e1084ca7dd')
# tuples with db name and activity id
# Now, let's make it useful:
for db_name, db_key in results:
a = db.get(db_key)
print(a)
'electricity production, hydro, pumped storage' (kilowatt hour, CA-SK, None)
'electricity production, photovoltaic, 3kWp slanted-roof installation, single-Si, panel, mounted' (kilowatt hour, CA-SK, None)
'electricity production, wind, 1-3MW turbine, onshore' (kilowatt hour, CA-SK, None)
'electricity production, hydro, run-of-river' (kilowatt hour, CA-SK, None)
'electricity production, natural gas, combined cycle power plant' (kilowatt hour, CA-SK, None)
'fava bean production' (kilogram, CA-SK, None)
'electricity production, hydro, reservoir, non-alpine region' (kilowatt hour, CA-SK, None)
'electricity production, lignite' (kilowatt hour, CA-SK, None)
'lentil production' (kilogram, CA-SK, None)
'electricity production, photovoltaic, 570kWp open ground installation, multi-Si' (kilowatt hour, CA-SK, None)
'electricity production, natural gas, conventional power plant' (kilowatt hour, CA-SK, None)
'electricity production, wind, >3MW turbine, onshore' (kilowatt hour, CA-SK, None)
'electricity, high voltage, production mix' (kilowatt hour, CA-SK, None)
'electricity production, photovoltaic, 3kWp slanted-roof installation, multi-Si, panel, mounted' (kilowatt hour, CA-SK, None)
'electricity production, oil' (kilowatt hour, CA-SK, None)
'protein pea production' (kilogram, CA-SK, None)
'electricity production, wind, <1MW turbine, onshore' (kilowatt hour, CA-SK, None)
@CHarpprecht & @dlr-bf Here's an example regarding your example ref product (concrete 25MPa)
another_filter = bdq.Filter("reference product", "has", "concrete, 25MPa")
concrete_query = bdq.Query()
concrete_query.add(another_filter)
results = concrete_query(db_data)
for db_name, db_key in results:
a = db.get(db_key)
print(a)
'concrete production, 25MPa, for building construction, for interior use, with cement ZN/D, with 50% RC-C aggregates' (cubic meter, CH, None)
'market for concrete, 25MPa' (cubic meter, CH, None)
'concrete production, 25MPa, for building construction, with cement, CP III' (cubic meter, BR, None)
'concrete production, 25MPa, for building construction, with cement, CEM II/A' (cubic meter, CH, None)
'market for concrete, 25MPa' (cubic meter, RNA, None)
'concrete production, 25MPa, for building construction, with cement, CEM II/B' (cubic meter, CH, None)
'concrete production, 25MPa, for building construction, with cement, CEM II/B' (cubic meter, RoW, None)
'concrete production, 25MPa, for building construction, for interior use, with cement ZN/D' (cubic meter, CH, None)
'concrete production, 25MPa, for building construction, with cement, CEM III/A' (cubic meter, GLO, None)
'concrete production, 25MPa, for building construction, for interior use, with cement ZN/D, with <30% RC-M aggregates' (cubic meter, CH, None)
'concrete production, 25MPa, for building construction, with cement, CP-II-F' (cubic meter, BR, None)
'concrete production, 25MPa, for building construction, for interior use, with cement, Portland' (cubic meter, RoW, None)
'concrete production, 25MPa, for building construction, for interior use, with cement, Portland' (cubic meter, North America without Quebec, None)
'market for concrete, 25MPa' (cubic meter, ZA, None)
'concrete production, 25MPa, for building construction, exposure class XC1, with cement, unspecified' (cubic meter, AT, None)
'market for concrete, 25MPa' (cubic meter, RoW, None)
'concrete production, 25MPa, for building construction, with cement, CEM II/A' (cubic meter, RoW, None)
'concrete production, 25MPa, for building construction, exposure class XC2, with cement, unspecified' (cubic meter, AT, None)
'concrete production, 25MPa, for building construction, with cement, CP-II-E' (cubic meter, BR, None)
'concrete production, 25MPa, for building construction, with cement, CEM II/B-V' (cubic meter, ZA, None)
'concrete production, 25MPa, for building construction, for interior use, with cement, Portland' (cubic meter, CA-QC, None)
'market for concrete, 25MPa' (cubic meter, BR, None)
'market for concrete, 25MPa' (cubic meter, AT, None)
and now extened for location:
another_filter = bdq.Filter("reference product", "has", "concrete, 25MPa")
another_location_filter = bdq.Filter("location", "is", "North America without Quebec")
concrete_query = bdq.Query()
concrete_query.add(another_filter)
concrete_query.add(another_location_filter)
results = concrete_query(db_data)
for db_name, db_key in results:
a = db.get(db_key)
print(a)
'concrete production, 25MPa, for building construction, for interior use, with cement, Portland' (cubic meter, North America without Quebec, None)
Original report by Tomas Navarrete Gutierrez (Bitbucket: tomas_navarrete, ).
Observed state
When you search using the
db.search
facility, you can add filters for location. If the location value contains hyphens ('CA-SK', ‘UN-OCEANIA', ‘IN-North-eastern grid’) or spaces ('Europe without Switzerland', ‘loco location’, 'RER w/o DE+NL+RU') the search does not yield the expected results.but only using the :flag_ca: ‘CA’ part of the location works:
I tested this using
bw2data 3.5
andwhoosh 2.7.4
Suggestion
My suggestion is to change the schema, and use ID for the location field instead of TEXT.
Whoosh documentation states that:
I believe that there is only one location for an activity document, and that it is more useful to treat the location as a whole.
Here is what may need to change in file
schema.py
: