Open praneesha opened 2 years ago
Hey,
Is there an other action with a broader pathsToMatch
that will match https://lib.ballerina.io/ballerina/grpc/latest/
?
If so, the URL will be crawled in both actions, and might override the results. You can exclude the url by adding !https://lib.ballerina.io/ballerina/grpc/latest/**
to the pathsToMatch
of that other action
@shortcuts - Thanks a lot for the quick response.
Yes, we do have another action, which uses this URL. I excluded it as follows and re-ran the crawler.
indexName: "ballerina",
pathsToMatch: [
"https://lib.ballerina.io**/**",
"!https://lib.ballerina.io/ballerina/grpc/latest/**",
],
However, the search results are still not updated as expected. Anything else that we need to do?
Looking at your index, records are populated with the correct weight and pageRank
, maybe you need to provide an higher value?
@shortcuts - I think we will have to totally exclude the old versions of a particular URL to stop getting them in the search results?
For example,
https://lib.ballerina.io/ballerina/http/latest/**
should be includedhttps://lib.ballerina.io/ballerina/http/2.0.1/**
should be excludedhttps://lib.ballerina.io/ballerina/http/2.0.0/**
should be excludedDo we have to exclude the old versions by adding an entry for each of them as follows?
indexName: "ballerina",
pathsToMatch: [
"https://lib.ballerina.io/ballerina/http/latest/**",
"!https://lib.ballerina.io/ballerina/http/2.0.1/**",
"!https://lib.ballerina.io/ballerina/http/2.0.0/**",
],
Exactly, as long as an URL will match a pathsToMatch
, records will be created! You can define URLs you don't want to crawl as exclusionPatterns
@shortcuts - Now, we have updated the exclusionPatterns
as follows.
startUrls: [
"https://ballerina.io/",
"https://lib.ballerina.io/",
"https://blog.ballerina.io/",
"https://central.ballerina.io/",
],
renderJavaScript: false,
sitemaps: ["https://ballerina.io/sitemap.xml"],
exclusionPatterns: [
"https://lib.ballerina.io/ballerina/http/2.0.1/**",
"https://lib.ballerina.io/ballerina/http/2.0.0/**",
"https://lib.ballerina.io/ballerina/http/1.1.0-beta.2/**",
"https://lib.ballerina.io/ballerina/http/1.1.0-beta.1/**",
"https://lib.ballerina.io/ballerina/http/1.1.0-alpha8/**",
],
However, we still get these excluded URLs in the search results as shown below. Anything we have missed here?
You can test them directly in the URL tester (crawler -> editor -> right side tab URL tester) to see if they are excluded. If not, they are crawled.
What are the URLs?
@shortcuts - I have deprioritised the URL https://ballerina.io**/**
than the https://lib.ballerina.io/ballerina/grpc/latest/**
.
However, still results from the URL https://ballerina.io**/**
are still apprearin on the search results for the search term grpc
and the results from the prioritised https://lib.ballerina.io/ballerina/grpc/latest/**
are not appearing at all as shown below.
It says the URL is ignored when tested as shown below.
What is wrong here?
The ranking seems fine on your screenshot, pages with a rank 4 are higher than pages with a rank 1. An higher page rank will place results before pages with a lower/no page rank, see https://docsearch.algolia.com/docs/record-extractor#boosting-search-results-with-pagerank
It says the URL is ignored when tested as shown below.
You passed the **
glob so it says 404, since the URL does not exist on your website.
@shortcuts - The **
was used as a wildcard to crawl all the URLs that come with elements appended after .../latest/
in https://lib.ballerina.io/ballerina/grpc/latest/. For example, https://lib.ballerina.io/ballerina/grpc/latest/clients/Caller.
Is it incorrect to use the glob like that? If so how to crawl content in all those nested URLs?
However, the URL is still ignored when tested even though I removed the glob as shown below.
@shortcuts - Any update on the above?
Is it incorrect to use the glob like that? If so how to crawl content in all those nested URLs?
Hey, this is indeed correct for the config, but the URL tester should have a direct URL.
(For the screenshot, redirect means that the URL found is redirecting to an other one, so we skipped the crawl)
If there's any URLs that are not crawled, you should check the Monitoring
section to see the reason. This FAQ could also help you!
@shortcuts - Thanks for the response. So, does that mean we cannot crawl URLs that are being redirected to another? In that case, do we need to give the original (redirected) URL in the config?
@shortcuts - I tried the direct URL, which does not have any redirection associated with it but still that also gets ignored.
We do have URLs like this on the website, which matches this pattern and I am not sure why they get ignored in the crawl.
https://lib.ballerina.io/ballerina/grpc/1.1.1/enums/CertValidationType
Can you please help to figure the reason out?
So, does that mean we cannot crawl URLs that are being redirected to another?
If we find both URLs, we will only crawl the one that does not redirect.
tried the direct URL, which does not have any redirection associated with it but still that also gets ignored.
As per https://github.com/algolia/docsearch-configs/issues/5000#issuecomment-1009692137, the URL tester
only accept direct URLs, which means you can't use globs in it. globs are used in the config for the pathsToMatch
, etc.
So if you try with the direct URL, you will see this (see screenshot), which means that you have multiple actions
/pathsToMatch
matching this URL, which creates duplicate records (L37, L94). You need to use negative patterns (see L95) to avoid this issue.
We have implemented the Docearch V3. We need to prioritize a URL (i.e.,
https://lib.ballerina.io/ballerina/grpc/latest/**
)for a particular search term (i.e.,grpc
) and we have edited the JSON config in the new Crawler web interface as follows.Although we scheduled and ran a manual crawl, the search results haven't got updated. What are we missing here?
Do you want to request a feature or report a bug?
If it is a DocSearch index issue, what is the related
index_name
?index_name
=What is the current behaviour?
If the current behaviour is a bug, please provide all the steps to reproduce and screenshots with context.
What is the expected behaviour?
What have you tried to solve it?
Any quick clues?
Any other feedback / questions ?