algolia / docsearch-configs

DocSearch - Configurations
https://docsearch.algolia.com/
MIT License
457 stars 1.13k forks source link

Prioritise Specific URLs in the Configs JSON #5000

Open praneesha opened 2 years ago

praneesha commented 2 years ago

We have implemented the Docearch V3. We need to prioritize a URL (i.e., https://lib.ballerina.io/ballerina/grpc/latest/**)for a particular search term (i.e., grpc) and we have edited the JSON config in the new Crawler web interface as follows.

 {
      indexName: "ballerina",
      pathsToMatch: ["https://lib.ballerina.io/ballerina/grpc/latest/**"],
      recordExtractor: ({ $, helpers }) => {
        return helpers.docsearch({
          recordProps: {
            lvl1: ".content h1",
            content: ".content p, .content li",
            lvl0: {
              selectors: "",
              defaultValue: "Ballerina gRPC",
            },
            lvl2: ".content h2",
            lvl3: ".content h3",
            lvl4: ".content h4",
            lvl5: ".content h5,.content h6",
            site: {
              defaultValue: ["ballerina_api_docs_grpc"],
            },
            tags: {
              defaultValue: ["ballerina_api_docs_grpc"],
            },
            pageRank: "4",
          },
          indexHeadings: true,
        });
      },
    },

Although we scheduled and ran a manual crawl, the search results haven't got updated. What are we missing here?

Do you want to request a feature or report a bug?

If it is a DocSearch index issue, what is the related index_name ?

index_name=

What is the current behaviour?

If the current behaviour is a bug, please provide all the steps to reproduce and screenshots with context.

What is the expected behaviour?

What have you tried to solve it?

Any quick clues?

Any other feedback / questions ?

shortcuts commented 2 years ago

Hey,

Is there an other action with a broader pathsToMatch that will match https://lib.ballerina.io/ballerina/grpc/latest/?

If so, the URL will be crawled in both actions, and might override the results. You can exclude the url by adding !https://lib.ballerina.io/ballerina/grpc/latest/** to the pathsToMatch of that other action

praneesha commented 2 years ago

@shortcuts - Thanks a lot for the quick response.

Yes, we do have another action, which uses this URL. I excluded it as follows and re-ran the crawler.

 indexName: "ballerina",
      pathsToMatch: [
        "https://lib.ballerina.io**/**",
        "!https://lib.ballerina.io/ballerina/grpc/latest/**",
      ],

However, the search results are still not updated as expected. Anything else that we need to do?

shortcuts commented 2 years ago

Looking at your index, records are populated with the correct weight and pageRank, maybe you need to provide an higher value?

praneesha commented 2 years ago

@shortcuts - I think we will have to totally exclude the old versions of a particular URL to stop getting them in the search results?

For example,

Do we have to exclude the old versions by adding an entry for each of them as follows?

 indexName: "ballerina",
      pathsToMatch: [
        "https://lib.ballerina.io/ballerina/http/latest/**",
        "!https://lib.ballerina.io/ballerina/http/2.0.1/**",
        "!https://lib.ballerina.io/ballerina/http/2.0.0/**",
      ],
shortcuts commented 2 years ago

Exactly, as long as an URL will match a pathsToMatch, records will be created! You can define URLs you don't want to crawl as exclusionPatterns

praneesha commented 2 years ago

@shortcuts - Now, we have updated the exclusionPatterns as follows.

  startUrls: [
    "https://ballerina.io/",
    "https://lib.ballerina.io/",
    "https://blog.ballerina.io/",
    "https://central.ballerina.io/",
  ],
  renderJavaScript: false,
  sitemaps: ["https://ballerina.io/sitemap.xml"],
  exclusionPatterns: [
    "https://lib.ballerina.io/ballerina/http/2.0.1/**",
    "https://lib.ballerina.io/ballerina/http/2.0.0/**",
    "https://lib.ballerina.io/ballerina/http/1.1.0-beta.2/**",
    "https://lib.ballerina.io/ballerina/http/1.1.0-beta.1/**",
    "https://lib.ballerina.io/ballerina/http/1.1.0-alpha8/**",
  ],

However, we still get these excluded URLs in the search results as shown below. Anything we have missed here?

Screenshot 2022-01-04 at 16 17 26

shortcuts commented 2 years ago

You can test them directly in the URL tester (crawler -> editor -> right side tab URL tester) to see if they are excluded. If not, they are crawled.

What are the URLs?

praneesha commented 2 years ago

@shortcuts - I have deprioritised the URL https://ballerina.io**/** than the https://lib.ballerina.io/ballerina/grpc/latest/**.

However, still results from the URL https://ballerina.io**/** are still apprearin on the search results for the search term grpcand the results from the prioritised https://lib.ballerina.io/ballerina/grpc/latest/** are not appearing at all as shown below.

Screenshot 2022-01-07 at 12 55 49

It says the URL is ignored when tested as shown below.

Screenshot 2022-01-07 at 12 57 33

What is wrong here?

shortcuts commented 2 years ago

The ranking seems fine on your screenshot, pages with a rank 4 are higher than pages with a rank 1. An higher page rank will place results before pages with a lower/no page rank, see https://docsearch.algolia.com/docs/record-extractor#boosting-search-results-with-pagerank

It says the URL is ignored when tested as shown below.

You passed the ** glob so it says 404, since the URL does not exist on your website.

praneesha commented 2 years ago

@shortcuts - The ** was used as a wildcard to crawl all the URLs that come with elements appended after .../latest/ in https://lib.ballerina.io/ballerina/grpc/latest/. For example, https://lib.ballerina.io/ballerina/grpc/latest/clients/Caller.

Is it incorrect to use the glob like that? If so how to crawl content in all those nested URLs?

However, the URL is still ignored when tested even though I removed the glob as shown below.

Screenshot 2022-01-10 at 11 03 13

praneesha commented 2 years ago

@shortcuts - Any update on the above?

shortcuts commented 2 years ago

Is it incorrect to use the glob like that? If so how to crawl content in all those nested URLs?

Hey, this is indeed correct for the config, but the URL tester should have a direct URL.

(For the screenshot, redirect means that the URL found is redirecting to an other one, so we skipped the crawl)

If there's any URLs that are not crawled, you should check the Monitoring section to see the reason. This FAQ could also help you!

praneesha commented 2 years ago

@shortcuts - Thanks for the response. So, does that mean we cannot crawl URLs that are being redirected to another? In that case, do we need to give the original (redirected) URL in the config?

praneesha commented 2 years ago

@shortcuts - I tried the direct URL, which does not have any redirection associated with it but still that also gets ignored.

We do have URLs like this on the website, which matches this pattern and I am not sure why they get ignored in the crawl.

https://lib.ballerina.io/ballerina/grpc/1.1.1/enums/CertValidationType

Screenshot 2022-01-12 at 15 57 44

Can you please help to figure the reason out?

shortcuts commented 2 years ago

So, does that mean we cannot crawl URLs that are being redirected to another?

If we find both URLs, we will only crawl the one that does not redirect.

tried the direct URL, which does not have any redirection associated with it but still that also gets ignored.

As per https://github.com/algolia/docsearch-configs/issues/5000#issuecomment-1009692137, the URL tester only accept direct URLs, which means you can't use globs in it. globs are used in the config for the pathsToMatch, etc.

So if you try with the direct URL, you will see this (see screenshot), which means that you have multiple actions/pathsToMatch matching this URL, which creates duplicate records (L37, L94). You need to use negative patterns (see L95) to avoid this issue.

Screenshot 2022-01-12 at 11 33 41