brightway-lca / brightway2-data

Tools for the management of inventory databases and impact assessment methods. Part of the Brightway LCA framework.
https://docs.brightway.dev/
BSD 3-Clause "New" or "Revised" License
11 stars 24 forks source link

Fine grained search of an activity #134

Closed ccomb closed 1 year ago

ccomb commented 1 year ago

In Agribalyse we have activity variants whose name differs just by a single character:

>>> pprint(bw2data.Database("Agribalyse 3.1.1").search("Sunflower grain organic"))
['Sunflower grain, organic, system number 4, at farm gate {FR} U' (kilogram, None, None),
 'Sunflower grain, organic, system number 1, at farm gate {FR} U' (kilogram, None, None),
 'Sunflower grain, organic, system number 2, at farm gate {FR} U' (kilogram, None, None),
 'Sunflower grain, organic, system number 5, at farm gate {FR} U' (kilogram, None, None),
 'Sunflower grain, organic, system number 3, at farm gate {FR} U' (kilogram, None, None),
 'Sunflower grain, organic, system number 1, at farm gate {FR} U' (kilogram, None, ('Materials/fuels',)),
 'Sunflower grain, organic, system number 4, at farm gate {FR} U' (kilogram, None, ('Materials/fuels',)),
 'Sunflower grain, organic, system number 3, at farm gate {FR} U' (kilogram, None, ('Materials/fuels',)),
 'Sunflower grain, organic, system number 2, at farm gate {FR} U' (kilogram, None, ('Materials/fuels',)),
 'Sunflower grain, organic, system number 5, at farm gate {FR} U' (kilogram, None, ('Materials/fuels',))]

For ecobalyse we need to select the right activity without depending on the activity identifier, which may be different depending on the software, database, and even database version. Doing so is also more future-proof because we won't depend on the changing code if the database is upgraded. The idea is to keep a search term as the reference of an activity, instead of the identifier/code. So from the above list, say we want to select Sunflower grain, organic, system number 3, at farm gate.

With the default setup of Brightway, it's not possible to search this exact activity, because the underlying search engine defaults to ignore single characters and prevents from specifying 'system number 3' in quotes:

>>> pprint(db.search('Sunflower grain organic system number 3'))
['Sunflower grain, organic, system number 5, at farm gate {FR} U' (kilogram, None, None),
 'Sunflower grain, organic, system number 4, at farm gate {FR} U' (kilogram, None, None),
 'Sunflower grain, organic, system number 1, at farm gate {FR} U' (kilogram, None, None),
 'Sunflower grain, organic, system number 3, at farm gate {FR} U' (kilogram, None, None),
 'Sunflower grain, organic, system number 2, at farm gate {FR} U' (kilogram, None, None)]
>>> len(db.search('"Sunflower grain organic system number 3"'))
5
>>> len(db.search("'Sunflower grain organic system number 3'"))
5
>>> len(db.search("name:'Sunflower grain organic system number 3'"))
5

There is a default StopFilter in whoosh that prevents to search single character words. For the name field of activities I think it would be relevant to completely remove the StopFilter and it's minimum size, because every single character may be relevant is an activity name.

What do you think?

I've tried the following: In the bw2schema, just replace the name field with: name=TEXT(stored=True, sortable=True, analyzer=StandardAnalyzer(stoplist=None, minsize=1))

Then it's possible to search the exact activity:

>>> db.search("Sunflower grain organic system 3")
['Sunflower grain, organic, system number 3, at farm gate {FR} U' (kilogram, None, None),
 'Sunflower grain, organic, system number 3, at farm gate {FR} U' (kilogram, None, ('Materials/fuels',))]

(The last two choices if needed can be disambiguated by specifying the category or the code in the search term in last resort)

cmutel commented 1 year ago

@ccomb Fantastic issue report. Please submit a PR with the fix you have already done, and a test; I will port it to whatever branch you don't use and release a new bw2data version.