ThreeSixtyGiving / grantnav

This is a web based search tool for data in the 360 giving data format.
http://grantnav.threesixtygiving.org/
Other
9 stars 5 forks source link

Recipient searching in filters doesn't understand alternative names like the Recipient search page does #965

Open mariongalley opened 1 year ago

mariongalley commented 1 year ago

Describe the bug If I search for an organisation that has several legitimate names e.g. "360Giving" and "360 Giving" in the Recipient search, it returns the relevant results for both forms of the name. If I search in the Recipient filter on the main search page, it only recognises the primary FTC name.

To Reproduce

  1. Go to main search
  2. Type "360Giving" in the Recipient search
  3. There are no results
  4. Go to Recipient search page
  5. Type "360Giving" in the search
  6. "360 Giving" comes up

Expected behavior I'd expect the Recipient filter to return similar results to the Recipient search page.

Screenshots image

image

mariongalley commented 1 year ago

I think #903 might be fixed though, as they are deduplicated in the same way

mariongalley commented 1 year ago

Related: searching in the Funder Filter takes into account diacritics, so “Esmee“ doesn't match “Esmée“, whereas it does in the Funder Search and Main Search

mariongalley commented 1 year ago

Objective: Use best available name in all places where Recipients data is aggregated across multiple grants/funders, namely: Recipients table on As Funder section of Org page & Recipients filter on main Grants search

mariongalley commented 1 year ago

I think this been made worse by the fact that FTC often provides capitalised versions of names @michaelwood e.g. searching Just for Kids Law now doesn't return the main relevant recipient because its canonical name is JUST FOR KIDS LAW

Image

Image

mariongalley commented 1 year ago

I also can no longer find 360Giving, whether I include the space or not

Image

Image

michaelwood commented 10 months ago

Summary

This is a product of the Canonical org name work where we chose to no longer show incorrect names for recipients/funders. The best improvement we can do without significant reworking is to make sure the search is case insensitive, I have a fix for this awaiting code review https://github.com/ThreeSixtyGiving/grantnav/commit/1af5af1890c0d23daf52b2066d6149be840919db .

Further explanation

The alternative names and spellings works on https://grantnav.threesixtygiving.org/recipients because this is a "search" across all of the fields in the organisations database which includes all the alternative names we know about, rather than a "filter" on the grant data.

For illustration:

[{ 
"name": "360 GIVING", 
"alternative_names": "360Giving",
"etc": "..."
}]

So the result on the recipients page comes up because it matches one of the fields ^

However in the grants data on https://grantnav.threesixtygiving.org/search the filters filter the returned grants:

[{
"id" : "grant1",
"canonical_recipient_name": "360 GIVING",
"etc": "..."
},
{
"id": "grant2",
"canonical_recipient_name": "360 GIVING",
"etc": "..."
}]

which results in the filter item:

"360 GIVING (2)"

There's no current way to combine the recipient canonical name and the grant given recipient name without making duplicates in the lists, which would undo the work we did to make the canonical names filterable (e.g. via the FindThatCharity data) in the grant data.

The only possible way I can think to do this is to create a special field on all the grants such as "ordered_recipient_names" that is an ordered list such as

[{
"id" : "grant1",
"ordered_recipient_names": "360 GIVING; 360Giving;",
"etc": "..."
},

and then have a way in the code to split the list and display the first one whilst still matching on the other text. This would mean refactoring the id_and_name mechanism which is currently used in 31 places in GrantNav, I estimate a minimum of 5 dev days of work.

@TaniaCohen @mariongalley

TaniaCohen commented 10 months ago

Hi Michael

Thanks for this. I have been experimenting with this and I am not sure this is a case sensitive issue on the recipient filters - and I don't think it is explained by your feedback on the canonical names.

As example, I'm currently finding no way to bring up 360Giving in the recipient field box and the search results appear to be just all the recipients - even when I start typing the exact case sensitive canonical name

[image: image.png] I have found some examples where the case sensitivity will improve things, but it doesn't seem to be working for all. Are there some cases where the Canonical name field is not being populated?

Tania

On Thu, 21 Dec 2023 at 12:18, Michael Wood @.***> wrote:

Summary

This is a product of the Canonical org name work where we chose to no longer show incorrect names for recipients/funders. The best improvement we can do without significant reworking is to make sure the search is case insensitive, I have a fix for this awaiting code review 1af5af1 https://github.com/ThreeSixtyGiving/grantnav/commit/1af5af1890c0d23daf52b2066d6149be840919db . Further explanation

The alternative names and spellings works on https://grantnav.threesixtygiving.org/recipients because this is a "search" across all of the fields in the organisations database which includes all the alternative names we know about, rather than a "filter" on the grant data.

For illustration:

[{ "name": "360 GIVING", "alternative_names": "360Giving","etc": "..." }]

So the result on the recipients page comes up because it matches one of the fields ^

However in the grants data on https://grantnav.threesixtygiving.org/search the filters filter the returned grants:

[{"id" : "grant1","canonical_recipient_name": "360 GIVING","etc": "..." }, {"id": "grant2","canonical_recipient_name": "360 GIVING","etc": "..." }]

which results in the filter item:

"360 GIVING (2)"

There's no current way to combine the recipient canonical name and the grant given recipient name without making duplicates in the lists, which would undo the work we did to make the canonical names filterable (e.g. via the FindThatCharity data) in the grant data.

The only possible way I can think to do this is to create a special field on all the grants such as "ordered_recipient_names" that is an ordered list such as

[{"id" : "grant1","ordered_recipient_names": "360 GIVING; 360Giving;","etc": "..." },

and then have a way in the code to split the list and display the first one whilst still matching on the other text. This would mean refactoring the id_and_name mechanism which is currently used in 31 places in GrantNav, I estimate a minimum of 5 dev days of work.

@TaniaCohen https://github.com/TaniaCohen @mariongalley https://github.com/mariongalley

— Reply to this email directly, view it on GitHub https://github.com/ThreeSixtyGiving/grantnav/issues/965#issuecomment-1866152734, or unsubscribe https://github.com/notifications/unsubscribe-auth/APJKBYCOEMQUXTJP7XPXHBLYKQSH5AVCNFSM6AAAAAAUQGZGRKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRWGE2TENZTGQ . You are receiving this because you were mentioned.Message ID: @.***>

--

Tania Cohen (she/her)

Chief Executive, 360Giving

Tel. +44 (0)20 8145 7659 | +44 (0)775 439 2125

Skype: TaniaNCohen

Twitter: @TaniaNC

Review our strategy 2022-25, Unleashing the Impact of Grants Data https://www.threesixtygiving.org/unleashing/.

360Giving is a company https://beta.companieshouse.gov.uk/company/09668396 and a registered charity http://beta.charitycommission.gov.uk/charity-details/?regid=1164883&subid=0. Read our privacy notice http://www.threesixtygiving.org/privacy/ to find out how we collect and use personal data.

Registered Address: 360 Giving c/o Sayer Vincent, Invicta House, 108-114 Golden Lane, London EC1Y 0TL

michaelwood commented 10 months ago

Dev notes - A potential improvement using ElasticSearch's text matching features is not going to be a quick fix here as the id_and_name field for the recipients/funders is not aggregatable without potential performance problems and is not recommended:

"RequestError(400, 'search_phase_execution_exception', 'Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [fundingOrganization.id_and_name] in order to load field data by uninverting the inverted index. Note that this can use significant memory.')" This is back to requiring a refactoring of the use of the id_and_name field or adding a feature to search across multiple fields (canonical name and other-names-known-by-in-the-data)

michaelwood commented 10 months ago

@TaniaCohen unfortunately the screenshot wasn't attached in the github issue. Yes just to clarify the case matching isn't going to implement the multi-field text searching feature that I think I understand being described. As there is formally no such organisation as "360Giving" this will be an issue when we're only searching the canonical names without changing that feature significantly.

However adding case insensitive matching should help in many cases as well: image

michaelwood commented 10 months ago

For prioritisation info: total spent time 4.5 hours, estimated refactoring to add multi field / fall-back searching 4 dev days (slightly less than original as I've now tried a few of the simpler options)

michaelwood commented 10 months ago

dev notes: Another possible option is to double down on the slight technical debt of id_and_name field and add another value(s) to it that contains all the other known names which may then match on the keyword index field type

michaelwood commented 10 months ago

To add a bit more context & summary, the change that has brought this issue to light is two parts

Canonical names

The previous matching mechanism matched duplicated entries where multiple versions of the same name were in use. We have removed and consolidated these using the de-duplication mechanisms. This results in the canonical/registered name being the name used in the filter list which provides a better user experience.

We never intentionally matched against alternative names, it was a product of erroneous data creating duplicates using alternative names.

Capitalised names

The canonical names data is mostly in upper case from the source (unless we're using publisher names from salesforce). Up until now the mechanism that matched the searched text was based on the rule:

{part} OR {part.capitalize()} where "part" is all the parts of the search term split on a space e.g. "360 giving" is searched for using, "360" Or "giving" OR "Giving".

The new addition that will go live imminently adds the ALL lower OR upper case version of the parts e.g. "GIVING" OR "giving". This improves matching against the new de-duplicated canonical names.

michaelwood commented 10 months ago

Case insensitive matching deployed to live

mariongalley commented 10 months ago

We never intentionally matched against alternative names, it was a product of erroneous data creating duplicates using alternative names.

@michaelwood I discussed this with the team and we do still think that there are ways in which the recipient filter used to search "better" that aren't just to do with case sensitivity or the existence of more duplicates. For example, it used to be that when you typed "360", 360Giving would come up, and now you have to type "360 Giv" before it comes up. In general it seems to take typing a lot more of the recognised name before you get relevant results starting to appear. Is it possible that the configuration has changed so that it doesn't start looking up results until further into the prompt?

mariongalley commented 10 months ago

@michaelwood Thank you for applying the case-sensitivity fix, this is a great improvement.

We do want to pursue the alternative names fix but it's not as high a priority. FindThatCharity and the Charity Commission have good alternative names which we could use rather than looking at other names in the data, which could sometimes be misleading. Let's look at this when planning for next FY.

michaelwood commented 10 months ago

The thing about 360 G is that lots of org-ids also begin like that. The filter supports searching by org-id so probably what is happening is that since 360 Giving is using a real org-id GB-CHC-1164883 in the data the matching score is much lower down the order and therefore not being displayed. This is also why if you search 360G you get lots of results still even though those names don't appear to match.

image

michaelwood commented 10 months ago

360 Giving is using a real org-id

By which I mean this may have changed recently either via the canonical orgs work or someone has fixed their data

michaelwood commented 10 months ago

I also checked and no one has changed the code which provides the filter search results for 3 years before I added the case insensitive matching. I can be fairly certain to say the software hasn't changed in that regard (never 100% certainty because software development is affected by more factors that just the specific bit of code).