artifacthub / hub

Find, install and publish Cloud Native packages
https://artifacthub.io
Apache License 2.0
1.71k stars 234 forks source link

Irrelevant search by default #2632

Closed marniks7 closed 1 year ago

marniks7 commented 1 year ago

Describe the bug Search is irrelevant by default

To Reproduce Steps to reproduce the behavior:

  1. Go to https://artifacthub.io/packages/search?ts_query_web=postgres&sort=relevance&page=1 (this is search by default) AR: some 1 year old results with no updates image

Expected behavior Bitnami should be among first as with sorting by stars (p.s. I am not affiliated with bitnami in any kind) An old gamer, trustful, latest version, regular updates over the years - this is one of the most relevant ones image

Desktop:

tegioz commented 1 year ago

Hi @marniks7

For each package, AH builds a text search document from some pieces of information, like the package name, description or keywords, among others. Each of those pieces is assigned a different weight. When sorting the search results by relevance, we rank higher matches in the pieces with more weight, like the package name, as otherwise often some "less relevant" packages with many stars get easily to the top.

In this case, when you searched for postgres, some exact matches in packages names were found, so they were displayed first (please note that Bitnami's package is named postgresql). I wouldn't call an exact match on the package name an irrelevant result, but I understand it's not what users may expect in this particular case, makes sense. In situations like this switching the sorting criteria to stars may be handy, but this may not be always intuitive. According to the views stats of each of those packages, most users seem to be finding their way to the Bitnami's version (one of the most viewed packages in AH!), but maybe we can do better anyway :)

I was thinking that maybe we could introduce some sort of package alternative name or alias, and index it with the same weight as if it was the package name. In this case, one of the problems is that we use both postgres and postgresql to refer to the same "thing". This can also apply to other cases like mongo and mongodb, for example. So this may help to teach AH about those variants and improve search results. This information could be easily provided by the publishers, using a special annotation in the case of Helm charts.

tegioz commented 1 year ago

We've implemented the alternative name idea I suggested in my last comment, plus some minor adjustments in the search results ranking that should favor packages with more stars. I think the combination of both should help with the concrete case you shared (hopefully even move it to the top). These adjustments are experimental, we need to see the impact now in the wild.

It'd be great if you could send a PR to add the new alternativeName annotation to the Bitnami's PostgreSQL chart 🙂

Thanks!

marniks7 commented 1 year ago

Hi @tegioz, thanks for this improvement!

It'd be great if you could send a PR to add the new alternativeName annotation to the Bitnami's PostgreSQL chart You mean here https://github.com/bitnami/charts/blob/main/bitnami/postgresql/Chart.yaml (and for HA version)... and the same for mongo... hm, adding those annotations for such task... not sure


Helm contains keywords section https://helm.sh/docs/topics/charts/ and...

In case if we use keywords and the rule specified above, we will get those alternativeNames. It seems that they are not exactly alternative names, but they may give some value.

grafana-loki and grafana match
haproxy and proxy match
postgresql and postgres match
postgresql and sql match
postgresql-ha and postgresql match
postgresql-ha and postgres match
postgresql-ha and sql match
external-dns and dns match
nginx-ingress-controller and ingress match
nginx-ingress-controller and nginx match
ejbca and ca match
grafana-tempo and grafana match
kube-prometheus and prometheus match
dokuwiki and wiki match
redis-cluster and redis match
contour-operator and contour match
contour-operator and operator match
rabbitmq-cluster-operator and rabbitmq match
rabbitmq-cluster-operator and operator match
mariadb-galera and mariadb match
mariadb-galera and galera match
wavefront-prometheus-storage-adapter and adapter match
wavefront-prometheus-storage-adapter and wavefront match
tensorflow-resnet and tensorflow match
tensorflow-resnet and resnet match
spring-cloud-dataflow and spring-cloud match
spring-cloud-dataflow and dataflow match
spring-cloud-dataflow and spring match
oauth2-proxy and oauth match
oauth2-proxy and oauth2 match
grafana-operator and grafana match
grafana-operator and operator match
mongodb-sharded and mongodb match
metallb and lb match
kiam and iam match
mysql and sql match
suitecrm and crm match
mediawiki and wiki match
jasperreports and jasper match
sealed-secrets and secrets match
phpbb and php match
memcached and cache match
wavefront-hpa-adapter and adapter match
wavefront-hpa-adapter and wavefront match
metrics-server and metrics match

p.s.

import os
import yaml

current_dir = os.getcwd()

subdirectories = [d for d in os.listdir(current_dir) if os.path.isdir(d)]

for subdir in subdirectories:
    chart_yaml_file = subdir + '/Chart.yaml'
    if os.path.isfile(chart_yaml_file):
        with open(chart_yaml_file, 'r') as stream:
            data = yaml.safe_load(stream)
            keywords = data['keywords']
            name = data['name']
            for keyword in keywords:
                if name != keyword and (keyword in name or name in keyword):
                    print(f"{name} and {keyword} match")
tegioz commented 1 year ago

Hi @marniks7

I meant just one 🙂 That would be helpful to introduce this new feature to the Bitnami charts maintainers, encourage others to contribute and do the same on other charts, and improve search results for this particular case.

Please note that Artifact Hub is already indexing the keywords available in the Chart.yaml file, and it's possible to search by them. However, when sorting by relevance, matches in the keywords are considered to be less relevant than a match on the package (chart in this case) name. That's the reason you get Bitnami's PostgreSQL chart when searching for postgres but not at the position you'd expect.

If the keywords had the same weight as the package name, they could easily be misused to alter results ordering. This is the reason why we have introduced that rule about the alternative name (being contained in the name or the name in it) when adding this new feature. We were trying to limit that let's say mongodb cannot be used as an alternative name to postgresql, as that alternative name is given the same weight as the name and it could be easily used to manipulate the ordering.