Closed MateuszDabrowski closed 3 years ago
Hi,
(I've tried to make your index work again so this is why you can see an "empty" index)
Our crawler returns 503 errors when trying to reach your website, is there any restrictions on your side? Do you know for how long it has been down?
Unfortunately, I don't know how long it hasn't been working.
I haven't implemented any restrictions from the time it was working.
The only non-content change that was implemented was updating the Docusaurus version from alpha to beta, but from what I can see, their page has a working search. The only difference I can see in their config is the contextualSearch: true,
option that I do not have in my config at the moment.
On our side, the crawl job is up and running, but returns 503 errors on every pages of your website, which should be related to some configurations on your server side.
(I didn't found a similar issue on the other DocSearch index yet)
For hosting, I'm using GitHub Pages and haven't changed anything related to it. If there are no more issues reported from other GitHub Pages users, there shouldn't be any backend change on that hosting.
The only change that I proactively did was the Docusaurus version bump, but I don't recall any breaking changes related to Algolia, nor I see any new info in the documentation regarding integration between those tools.
Do you recall any such issues, @slorber ?
@shortcuts just found out that not only my index disappeared and was readded today - there is also a completely new API Key displayed within the Algolia Dashboard. It seems that whatever deleted my index also reset the API configuration. I also see there is an Application ID available - should I also leverage it even with DocSearch (it wasn't needed in the past)?
The API key you see in the dashboard is the Analytics one, your search API key didn't changed, and they were both generated around this time: September 9th 2020, at 11:07.
DocSearch users share the same application ID, which is BH4D9OD16A and doesn't need to be stated with the frontend library as it is the default value.
Update: I found an other index with a 503 issue, both crawl work locally but not in production, I'll reach out to them to see if they are also using GitHub pages.
As far as I understand:
Prod site looks up and running, can't tell why the crawler see 503.
Note: trailingSlash: true
has been added recently, maybe it's the cause of the crawling issues? canonical URLs have changed and now include a trailing slash
Thank you @slorber!
Indeed, trailingSlash was added recently as per recommendation in the Docusaurus documentation. The true
settings were selected to work correctly with GitHub Pages (false
was breaking all old links).
I suppose it could be an issue for historic data, but rather shouldn't be the reason for the current error 503.
FYI - Algolia Dashboard now shows the searches (so the Events are sent correctly again), but no search results are displayed.
Hi @MateuszDabrowski,
FYI - Algolia Dashboard now shows the searches (so the Events are sent correctly again), but no search results are displayed.
This is related to my first message
(I've tried to make your index work again so this is why you can see an "empty" index)
Regarding the 503 error, the other index that was facing this issue got crawled yesterday without any changes from both sides. 🤔
The crawl job is still up for your website and ran around an hour ago, but failed for the same reason.
Ahoj @shortcuts,
I have commented out the trailingSlash config to bring it back to the original setup. Could you trigger crawl to check whether it somehow solves the 503?
Crawl was triggered half an hour ago, and the index did not rebuild even without the trailingSlash configuration.
Hi @MateuszDabrowski, I'm not sure where you've found this information but the last crawl was triggered 19 hours ago.
If it can't wait I can indeed trigger a new crawl, please let me know
Ahoj @shortcuts,
I checked the Dashboard, but I suppose that in this case, Analytics Index is separate from the crawl index.
It indeed is!
New crawl ran 5 mins ago and still error 503
So it shouldn't be related to Docusaurus, as there were no other changes from my side, and per your and @slorber information, other Docusaurus pages are not impacted.
Can it be due to Cloudflare? Again, I haven't changed anything in its configuration for a long time, but maybe there is an option to allowlist the crawler in the firewall settings?
What else could be done - is there any sense in deleting the Algolia account and creating a new one?
Hey @MateuszDabrowski,
Can it be due to Cloudflare? Again, I haven't changed anything in its configuration for a long time, but maybe there is an option to allowlist the crawler in the firewall settings?
It definitely could, but I assume some of our users might have a configuration similar as yours. In case you need to whitelist a user agent, we use Algolia DocSearch Crawler
for our scraper, you can also override it by adding user_agent: "foo"
to your config file.
What else could be done - is there any sense in deleting the Algolia account and creating a new one?
The Algolia account will receive the retrieved records, so it won't have any effect on this.
You could try hosting your website on Netlify and update your config file with the new start_urls
, to see if it's related to your hosting solution
Allowlisted the crawler in Cloudflare and asked GitHub Support to validate whether there was any block deployed from their side.
If this doesn't help, I will deploy to Netlify and check whether it changes anything.
Hey @MateuszDabrowski, a new crawl has been started 20 hours ago and failed for the same reason :( I'm not really sure what's the issue, but I guess deploying on Netlify will fix it
Ahoj @shortcuts,
Sorry for the long time without response, but was checking some things:
Based on the above, it seems that the issue is rather either on Algolia or Cloudflare side.
For now I updated the Cloudflare firewall rule from the previous to more broad one:
The previous configuration wasn't triggered:
Hey @MateuszDabrowski
Here's a snippet of the last error returned, this is on one of the URLs but it also happened on all the start_urls
:
{
"textPayload": "2021-07-29 10:09:56 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://mateuszdabrowski.pl/sitemap.xml> (failed 3 times): 503 Service Unavailable\n",
"timestamp": "2021-07-29T10:09:56.832469069Z",
"severity": "ERROR"
}
edit: I've deployed a new instance for your index, please check your mails for the new credentials
Deployed the website with the new apiKey. Fingers crossed :)
As for now, in the Dashboard, I see: No indices yet
(Search) and Access Restricted
(Overview).
As of now:
Hi @MateuszDabrowski,
Same 503 error, please try to deploy it on something like Netlify so we can identify the problem. Thanks!
(no indices etc. will be created if we aren't able to crawl your website, that's why you don't see anything in the Dashboard)
EDIT: Checked Cloudflare again and now I can see request coming through Cloudflare (from different IP and with different User Agent that the ones mentioned in your documentation):
Can you provide any more details on error 503? Screenshot or other data that could help understand which system is returning it?
The user agent seems to be the same as the one in the documentation: https://docsearch.algolia.com/docs/config-file/#user_agent-optional
Snippet of the errors, same goes for all the start_urls
in your config:
2021-07-30 10:26:39.109 CEST [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET http://mateuszdabrowski.pl/docs/> (failed 3 times): 503 Service Unavailable
2021-07-30 10:26:39.109 CEST [mateuszdabrowski] ERROR: Http Status:503 on http://mateuszdabrowski.pl/docs/
2021-07-30 10:26:39.109 CEST [mateuszdabrowski] ERROR: Alternative link: https://mateuszdabrowski.pl/docs/
2021-07-30 10:26:39.292 CEST [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://mateuszdabrowski.pl/docs/> (failed 3 times): 503 Service Unavailable
2021-07-30 10:26:39.292 CEST [mateuszdabrowski] ERROR: Http Status:503 on https://mateuszdabrowski.pl/docs/
Crawling issue: nbHits 0 for mateuszdabrowski
Indeed, seem I was looking at standard Algolia User-Agent and got that mixed up. Either way, both are whitelisted with the current configuration.
I also got a response from the GitHub Pages Engineering team:
Our engineers double-checked on this and verified that they’ll are not seeing any requests from Algolia hitting our CDN, though we do see traffic from the Bing crawler.
Please ask Algolia for more details about the error, such as logs for the response headers + payload.
Do you have any more detailed logs than the ones above?
Especially HTML response body that could help understand the error source as per this document: https://support.cloudflare.com/hc/en-us/articles/115003011431-Troubleshooting-Cloudflare-5XX-errors#503error ?
In the meantime, I will also ask Cloudflare for logs to see whether, despite the allowlist, there might be some other issue on that side. But for that, I will need the above response body.
@shortcuts Good news: it seems that the index is now available and the search works again. Bad news: results seem quite inconsistent.
For example: Searching for TOP
doesn't provide results from the SQL Select article that has it described (with TOP in h2).
Searching for SELECT
also doesn't provide SQL Select page, but other SQL documentation pages.
DISTINCT
doesn't have any result despite having its own h2.
On the other hand Template
returns correctly SSJS Script Template that was added just yesterday.
Similarly, AMPScript in SSJS
doesn't return a page with the same title, but only other pages that link to it.
Hey @MateuszDabrowski, it looks like after I've added the js_render
option to your config, it solved your issue. We now start the crawl with a selenium instance which seems to bypass the protection.
For the missing content, in case it is already indexed (see the selectors
in your config), see this part of the doc: https://docsearch.algolia.com/docs/config-file/#js_wait-optional
Feel free to update your config and open a pull request, I'll review it!
Ok, fingers crossed next run will pick up missing pages, as the general config looks good. I added some elements based on the documentation that might help improve the results and stability with Cloudflare: https://github.com/algolia/docsearch-configs/pull/4431
Cool! I'm closing this issue then, feel free to let me know if there's any issue.
I see the crawl was triggered after the merge and it seems it went good for all linked files but was blocked with JS Challenge for all page views.
Seems like Bot fight mode is overzealous and overwrites even the allowlist.
Switching it off to check if this will finally solve all the issues (perhaps even without js_render then).
Can confirm that the next crawl after the bot fight mode was switched off, all data seems to be picked up. So next test would be to merge the non-js_render config (as it was working in the past) to lower the overhead on the Algolia side.
Nice! Just merged it, let me know :) and thanks for investigating
On my side, in Cloudflare, I see all crawls with Allow state and search working correctly on the page. Unless you see any 503 on your side, it looks like all is good finally :)
The sum up of the issue seems to be:
If you are using Algolia and Cloudflare, the Bot Fight Mode might block the Algolia Crawler (fully the standard one and, to some extent, also the js_render one). This is happening even if the Algolia Crawler is allow listed correctly in Firewall Rules, as Bot Fight Mode is applied after them and blocks even Allowed Crawlers. Currently, there is no option to have the Bot Fight Mode enabled and Algolia Crawler fully working.
Thank you so much for the support and debugging the issue!
Ahoj,
Yesterday I found out that my Algolia Dashboard got completely reset - all historical data was lost, no new data was available, and the interface was asking me to create an index (which didn't work, as, within doc search, I don't seem to have permission to do it).
I sent a message to support and while I did not get any response, today I see that the index has been recreated on my Algolia Dashboard (historic data is still not available).
However, it still doesn't collect data, and the search on the website still doesn't work (it shows past searches but any new shows no suggestions).
I'm using Docusaurus v2, and the Algolia configuration hadn't changed.