eclipse / openvsx

An open-source registry for VS Code extensions
https://open-vsx.org/
Eclipse Public License 2.0
1.2k stars 131 forks source link

Searching for exact ID is not reliable #727

Open filiptronicek opened 1 year ago

filiptronicek commented 1 year ago

When you search for the Jupyter^1 extension on Open VSX [direct search link], you will be met with the first result being CodeStream.codeStream. I believe this is because we treat extension namespaces and extension names separately, and the dot in the middle is preventing better search results.

Maybe we can add the extension id (namespace.extension) to the search criteria or try resolving ID-looking search queries directly.

image
amvanbaren commented 1 year ago

Maybe we can add the extension id (namespace.extension) to the search criteria

extensionId is part of the search criteria and has the highest boost.

            boolQuery.should(QueryBuilders.termQuery("extensionId.keyword", options.queryString).caseInsensitive(true)).boost(10);

            // Fuzzy matching of search query in multiple fields
            var multiMatchQuery = QueryBuilders.multiMatchQuery(options.queryString)
                    .field("name").boost(5)
                    .field("displayName").boost(5)
                    .field("tags").boost(3)
                    .field("namespace").boost(2)
                    .field("description")
                    .fuzziness(Fuzziness.AUTO)
                    .prefixLength(2);
            boolQuery.should(multiMatchQuery).boost(5);

            // Prefix matching of search query in display name and namespace
            var prefixString = options.queryString.trim().toLowerCase();
            var namePrefixQuery = QueryBuilders.prefixQuery("displayName", prefixString);
            boolQuery.should(namePrefixQuery).boost(2);
            var namespacePrefixQuery = QueryBuilders.prefixQuery("namespace", prefixString);
            boolQuery.should(namespacePrefixQuery);

Using #684 as a starting point, I think this happens because ms-toolsai.jupyter is not that frequently updated (2023-03-10T04:05:53.638673Z), making it possibly less relevant than codestream.codestream (2023-03-24T15:36:43.527142Z).

        var relevance = ratingRelevance * limit(ratingValue) + downloadsRelevance * limit(downloadsValue)
                + timestampRelevance * limit(timestampValue);

@filiptronicek Do you want me to check if this is a common issue for all exact ID searches?

filiptronicek commented 1 year ago

That's really interesting. Codestream has about 5K downloads, while Jupyter has about 800K - I'm trying to say maybe this could be taken into account as well, since people are more likely to search for more popular extensions.

Also found out that ms-toolsai/jupyter (note the / instead of .) gives back the correct result. Maybe Codestream is just odd with its metadata. I think we can keep this issue open if we bump into any other examples.

amvanbaren commented 1 year ago

It is taken into account, but freshness (timestamp) is prioritized over downloads. From https://github.com/EclipseFdn/open-vsx.org/blob/production/configuration/application.yml:

    relevance:
      rating: 0.2
      downloads: 1.0
      timestamp: 3.0
Mdnou commented 1 year ago

It is taken into account, but freshness (timestamp) is prioritized over downloads. From https://github.com/EclipseFdn/open-vsx.org/blob/production/configuration/application.yml:

    relevance:
      rating: 0.2
      downloads: 1.0
      timestamp: 3.0

https://github.com/dgileadi/vscode-java-decompiler/issues/17#issue-1489959120