gbif / registry

GBIF Registry
Apache License 2.0
34 stars 15 forks source link

Search ranking - what fields are included #427

Open MortenHofft opened 2 years ago

MortenHofft commented 2 years ago

https://api.gbif.org/v1/dataset/search?publishing_country=CO&q=Universidad%20del%20Atl%C3%A1ntico

The sentence "Universidad del Atlántico" doesn't seem to appear in any of the top results. But it is part of the text for one result https://api.gbif.org/v1/dataset/search?publishing_country=CO&q=Cerambycidae%20from%20the%20Caribbean%20region%20of%20Colombia in both description and publisher title.

"description": "This resource presents the checklist of the species from the family Cerambycidae (Coleoptera: Cerambycoidea) collected in the Caribbean region of Colombia, founded during the revision of entomological collections in Colombia. In this checklist one subfamily, 34 tribes, 90 genera and 132 species are reported. This information was gathered from the following entomological collections in Colombia: Universidad del Atlántico, Puerto Colombia (UARC), Museo Javeriano historia natural Lorenzo Uribe, S.J., Pontificia Universidad Javeriana, Bogotá (MPUJ), Instituto de Ciencias Naturales, Universidad Nacional, Bogotá (ICN), Colección Taxonómica Nacional de Insectos Luis María Murillo, Corporación Colombiana de Investigación agropecuaria, Mosquera (CTNI). For all species the distribution in the Caribbean departments was indicated.",
"publishingOrganizationKey": "69ad50d2-2560-42a4-b522-171f6eca4fa3",
"publishingOrganizationTitle": "Universidad del Atlántico",

I notice that if I go straight to elasticSearch and ask below then I get it as a first result

{
    "size": 1,
    "from": 0,
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "all": {
                            "query": "Universidad del Atlántico"
                        }
                    }
                }
            ],
            "filter": [
                {
                    "term": {
                        "publishingCountry": "CO"
                    }
                }
            ]
        }
    }
}

Is it a matter of us boosting some fields too much (for this use case)?

marcos-lg commented 2 years ago

@MortenHofft yes, we boost some fields. The ES query that is executed for that request is this:

{
    "from": 0,
    "size": 20,
    "query": {
        "bool": {
            "must": [
                {
                    "function_score": {
                        "query": {
                            "multi_match": {
                                "query": "Universidad del Atlántico",
                                "fields": [
                                    "all^1.0",
                                    "description^8.0",
                                    "doi^25.0",
                                    "hostingOrganizationTitle^5.0",
                                    "keyword^10.0",
                                    "metadata^3.0",
                                    "networkTitle^4.0",
                                    "projectId^2.0",
                                    "publishingOrganizationTitle^5.0",
                                    "title^20.0"
                                ],
                                "type": "best_fields",
                                "operator": "OR",
                                "slop": 100,
                                "prefix_length": 0,
                                "max_expansions": 50,
                                "minimum_should_match": "25%",
                                "tie_breaker": 0.2,
                                "zero_terms_query": "NONE",
                                "auto_generate_synonyms_phrase_query": true,
                                "fuzzy_transpositions": true,
                                "boost": 1.0
                            }
                        },
                        "functions": [
                            {
                                "filter": {
                                    "match_all": {
                                        "boost": 1.0
                                    }
                                },
                                "field_value_factor": {
                                    "field": "dataScore",
                                    "factor": 1.0,
                                    "missing": 0.0,
                                    "modifier": "ln2p"
                                }
                            }
                        ],
                        "score_mode": "multiply",
                        "boost_mode": "multiply",
                        "max_boost": 3.4028235E38,
                        "boost": 1.0
                    }
                }
            ],
            "filter": [
                {
                    "term": {
                        "publishingCountry": {
                            "value": "CO",
                            "boost": 1.0
                        }
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1.0
        }
    },
    "_source": {
        "includes": [
            "title",
            "type",
            "subtype",
            "description",
            "publishingOrganizationKey",
            "publishingOrganizationTitle",
            "publishingCountry",
            "endorsingNodeKey",
            "hostingOrganizationKey",
            "hostingOrganizationTitle",
            "hostingCountry",
            "license",
            "project.identifier",
            "nameUsagesCount",
            "occurrenceCount",
            "keyword",
            "decade",
            "countryCoverage",
            "doi",
            "networkKeys",
            "networkTitle"
        ],
        "excludes": [
            "all"
        ]
    },
    "sort": [
        {
            "_score": {
                "order": "desc"
            }
        }
    ],
    "track_total_hits": 2147483647
}
MortenHofft commented 2 years ago

Is an OR operator what users expect? I wonder if AND makes better sense. The behaviour of OR is adding more results as the user adds more words. With AND you narrow your results as you type more.

For reference: if I go to Amazon and search they seem to use AND: chocolate: 20K+; chocolate milk: 8K Same for Google: chocolate: 3B; chocolate milk: 800M

For GBIF it is the opposite obviously: chocolate: 1; chocolate milk: 4

I wonder if there is at least a way to score AND matches higher than OR