SocialGouv / recherche-entreprises

API de recherche d'entreprises Françaises
https://recherche-entreprises.fabrique.social.gouv.fr
32 stars 5 forks source link

feat(api): add ES score #197

Closed revolunet closed 11 months ago

revolunet commented 2 years ago

fix #196

github-actions[bot] commented 2 years ago

🎉 Deployment for commit 4a09f90da32149700a0c803674eb12e4fd8d447f :

Ingresses - 🚀 [https://api-recherche-entreprises-issue-196.dev.fabrique.social.gouv.fr/](https://api-recherche-entreprises-issue-196.dev.fabrique.social.gouv.fr/) - 🚀 [https://app-front-recherche-entreprises-issue-196.dev.fabrique.social.gouv.fr/](https://app-front-recherche-entreprises-issue-196.dev.fabrique.social.gouv.fr/)
Docker images - 📦 docker pull ghcr.io/socialgouv/recherche-entreprises/api:sha-4a09f90da32149700a0c803674eb12e4fd8d447f - 📦 docker pull ghcr.io/socialgouv/recherche-entreprises/front:sha-4a09f90da32149700a0c803674eb12e4fd8d447f
Debug - [📕 Loki logs for namespace recherche-entreprises-issue-196](https://grafana.fabrique.social.gouv.fr/explore?orgId=1&left=%5B%22now-6h%22,%22now%22,%22Loki%22,%7B%22expr%22:%22%7Bnamespace%3D%5C%22recherche-entreprises-issue-196%5C%22%7D%22%7D%5D) - [📈 Pods monitoring for namespace recherche-entreprises-issue-196](https://grafana.fabrique.social.gouv.fr/d/85a562078cdf77779eaa1add43ccec1e/kubernetes-compute-resources-namespace-pods?orgId=1&refresh=10s&var-datasource=default&var-cluster=dev2&var-namespace=recherche-entreprises-issue-196) - [📈 Workloads monitoring for namespace recherche-entreprises-issue-196](https://grafana.fabrique.social.gouv.fr/d/a87fb0d919ec0ea5f6543124e16c42a5/kubernetes-compute-resources-namespace-workloads?orgId=1&refresh=10s&var-datasource=default&var-cluster=dev2&var-namespace=recherche-entreprises-issue-196&var-type=deployment) - [👮‍♂️ Namespace rancher recherche-entreprises-issue-196](https://rancher.fabrique.social.gouv.fr/dashboard/c/c-gjtkk/explorer/namespace/recherche-entreprises-issue-196) - [👮‍♂️ Deployment app-api](https://rancher.fabrique.social.gouv.fr/dashboard/c/c-gjtkk/explorer/apps.deployment/recherche-entreprises-issue-196/app-api) - [👮‍♂️ Deployment app-front](https://rancher.fabrique.social.gouv.fr/dashboard/c/c-gjtkk/explorer/apps.deployment/recherche-entreprises-issue-196/app-front)
yohanboniface commented 2 years ago

Cool, merci!

Est-ce que tu penses que ce serait possible d'avoir un ordre de grandeur pour le score ? Idéalement l'avoir entre 0 et 1 (pour savoir si un score est bon "en absolu"), ou alors avoir le maxScore à côté ?

revolunet commented 2 years ago

C'est pas trivial de modifier le score en fait; il est calculé en fonction de la query et n'est pas normalisé sur [0,1] :/

Le maxScore c'est celui du 1er item de la liste si je me trompe pas

Un "explain" d'exemple pour le calcul du score :

{
  "value": 21.91652,
  "description": "sum of:",
  "details": [
    {
      "value": 17.90895,
      "description": "sum of:",
      "details": [
        {
          "value": 9.977191,
          "description": "weight(namingMain:michelin in 9444530) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 9.977191,
              "description": "score(freq=1.0), computed as boost * idf * tf from:",
              "details": [
                {
                  "value": 2.2,
                  "description": "boost",
                  "details": []
                },
                {
                  "value": 9.977191,
                  "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details": [
                    {
                      "value": 1515,
                      "description": "n, number of documents containing term",
                      "details": []
                    },
                    {
                      "value": 32628324,
                      "description": "N, total number of documents with field",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 0.45454544,
                  "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "freq, occurrences of term within document",
                      "details": []
                    },
                    {
                      "value": 1.2,
                      "description": "k1, term saturation parameter",
                      "details": []
                    },
                    {
                      "value": 0,
                      "description": "b, length normalization parameter",
                      "details": []
                    },
                    {
                      "value": 4,
                      "description": "dl, length of field",
                      "details": []
                    },
                    {
                      "value": 1.7728646,
                      "description": "avgdl, average length of field",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        },
        {
          "value": 7.931761,
          "description": "weight(naming:michelin in 9444530) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 7.931761,
              "description": "score(freq=1.0), computed as boost * idf * tf from:",
              "details": [
                {
                  "value": 2.2,
                  "description": "boost",
                  "details": []
                },
                {
                  "value": 7.9317613,
                  "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details": [
                    {
                      "value": 11722,
                      "description": "n, number of documents containing term",
                      "details": []
                    },
                    {
                      "value": 32639261,
                      "description": "N, total number of documents with field",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 0.45454544,
                  "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "freq, occurrences of term within document",
                      "details": []
                    },
                    {
                      "value": 1.2,
                      "description": "k1, term saturation parameter",
                      "details": []
                    },
                    {
                      "value": 0,
                      "description": "b, length normalization parameter",
                      "details": []
                    },
                    {
                      "value": 4,
                      "description": "dl, length of field",
                      "details": []
                    },
                    {
                      "value": 2.6909916,
                      "description": "avgdl, average length of field",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      ]
    },
    {
      "value": 3.945307,
      "description": "Saturation function on the _feature field for the etablissements feature, computed as w * S / (S + k) from:",
      "details": [
        {
          "value": 4,
          "description": "w, weight of this function",
          "details": []
        },
        {
          "value": 1.84375,
          "description": "k, pivot feature value that would give a score contribution equal to w/2",
          "details": []
        },
        {
          "value": 133,
          "description": "S, feature value",
          "details": []
        }
      ]
    },
    {
      "value": 0.062262263,
      "description": "Saturation function on the _feature field for the siretRank feature, computed as w * S / (S + k) from:",
      "details": [
        {
          "value": 0.1,
          "description": "w, weight of this function",
          "details": []
        },
        {
          "value": 51814485000000,
          "description": "k, pivot feature value that would give a score contribution equal to w/2",
          "details": []
        },
        {
          "value": 85487029000000,
          "description": "S, feature value",
          "details": []
        }
      ]
    },
    {
      "value": 0,
      "description": "match on required clause, product of:",
      "details": [
        {
          "value": 0,
          "description": "# clause",
          "details": []
        },
        {
          "value": 1,
          "description": "etatAdministratifUniteLegale:A",
          "details": []
        }
      ]
    },
    {
      "value": 0,
      "description": "match on required clause, product of:",
      "details": [
        {
          "value": 0,
          "description": "# clause",
          "details": []
        },
        {
          "value": 1,
          "description": "etatAdministratifEtablissement:A",
          "details": []
        }
      ]
    }
  ]
}
yohanboniface commented 2 years ago

Est-ce qu'il y aurait pas moyen d'ajouter à la volée un score de comparaison entre chaque résultat trouvé et la chaîne cherchée ? Genre avec une comparaison levenshtein ou ngrams. Il me semble avoir fait ça dans ma folle jeunesse, mais c'est loin dans ma mémoire. Si tu veux je cherche plus :)