[mateuszdabrowski] Search doesn't return any results

MateuszDabrowski commented 3 years ago

Ahoj,

Yesterday I found out that my Algolia Dashboard got completely reset - all historical data was lost, no new data was available, and the interface was asking me to create an index (which didn't work, as, within doc search, I don't seem to have permission to do it).

I sent a message to support and while I did not get any response, today I see that the index has been recreated on my Algolia Dashboard (historic data is still not available).

However, it still doesn't collect data, and the search on the website still doesn't work (it shows past searches but any new shows no suggestions).

I'm using Docusaurus v2, and the Algolia configuration hadn't changed.

shortcuts commented 3 years ago

Hi,

(I've tried to make your index work again so this is why you can see an "empty" index)

Our crawler returns 503 errors when trying to reach your website, is there any restrictions on your side? Do you know for how long it has been down?

MateuszDabrowski commented 3 years ago

Unfortunately, I don't know how long it hasn't been working.

I haven't implemented any restrictions from the time it was working.

The only non-content change that was implemented was updating the Docusaurus version from alpha to beta, but from what I can see, their page has a working search. The only difference I can see in their config is the contextualSearch: true, option that I do not have in my config at the moment.

shortcuts commented 3 years ago

On our side, the crawl job is up and running, but returns 503 errors on every pages of your website, which should be related to some configurations on your server side.

(I didn't found a similar issue on the other DocSearch index yet)

MateuszDabrowski commented 3 years ago

For hosting, I'm using GitHub Pages and haven't changed anything related to it. If there are no more issues reported from other GitHub Pages users, there shouldn't be any backend change on that hosting.

The only change that I proactively did was the Docusaurus version bump, but I don't recall any breaking changes related to Algolia, nor I see any new info in the documentation regarding integration between those tools.

Do you recall any such issues, @slorber ?

MateuszDabrowski commented 3 years ago

@shortcuts just found out that not only my index disappeared and was readded today - there is also a completely new API Key displayed within the Algolia Dashboard. It seems that whatever deleted my index also reset the API configuration. I also see there is an Application ID available - should I also leverage it even with DocSearch (it wasn't needed in the past)?

shortcuts commented 3 years ago

The API key you see in the dashboard is the Analytics one, your search API key didn't changed, and they were both generated around this time: September 9th 2020, at 11:07.

DocSearch users share the same application ID, which is BH4D9OD16A and doesn't need to be stated with the frontend library as it is the default value.

shortcuts commented 3 years ago

Update: I found an other index with a 503 issue, both crawl work locally but not in production, I'll reach out to them to see if they are also using GitHub pages.

slorber commented 3 years ago

As far as I understand:

Site we talk about: https://mateuszdabrowski.pl/docs/
DocSearch config: https://github.com/algolia/docsearch-configs/blob/master/configs/mateuszdabrowski.json
Docusaurus config: https://github.com/MateuszDabrowski/mateuszdabrowski.pl/blob/master/docusaurus.config.js

Prod site looks up and running, can't tell why the crawler see 503.

Note: trailingSlash: true has been added recently, maybe it's the cause of the crawling issues? canonical URLs have changed and now include a trailing slash

MateuszDabrowski commented 3 years ago

Thank you @slorber!

Indeed, trailingSlash was added recently as per recommendation in the Docusaurus documentation. The true settings were selected to work correctly with GitHub Pages (false was breaking all old links).

I suppose it could be an issue for historic data, but rather shouldn't be the reason for the current error 503.

MateuszDabrowski commented 3 years ago

FYI - Algolia Dashboard now shows the searches (so the Events are sent correctly again), but no search results are displayed.

shortcuts commented 3 years ago

Hi @MateuszDabrowski,

FYI - Algolia Dashboard now shows the searches (so the Events are sent correctly again), but no search results are displayed.

This is related to my first message

(I've tried to make your index work again so this is why you can see an "empty" index)

Regarding the 503 error, the other index that was facing this issue got crawled yesterday without any changes from both sides. 🤔

The crawl job is still up for your website and ran around an hour ago, but failed for the same reason.

MateuszDabrowski commented 3 years ago

Ahoj @shortcuts,

I have commented out the trailingSlash config to bring it back to the original setup. Could you trigger crawl to check whether it somehow solves the 503?

MateuszDabrowski commented 3 years ago

Crawl was triggered half an hour ago, and the index did not rebuild even without the trailingSlash configuration.

shortcuts commented 3 years ago

Hi @MateuszDabrowski, I'm not sure where you've found this information but the last crawl was triggered 19 hours ago.

If it can't wait I can indeed trigger a new crawl, please let me know

MateuszDabrowski commented 3 years ago

Ahoj @shortcuts,

I checked the Dashboard, but I suppose that in this case, Analytics Index is separate from the crawl index.

MD_Safari-Analytics | Algolia_2021-07-08_09-45-24

shortcuts commented 3 years ago

It indeed is!

New crawl ran 5 mins ago and still error 503

MateuszDabrowski commented 3 years ago

So it shouldn't be related to Docusaurus, as there were no other changes from my side, and per your and @slorber information, other Docusaurus pages are not impacted.

Can it be due to Cloudflare? Again, I haven't changed anything in its configuration for a long time, but maybe there is an option to allowlist the crawler in the firewall settings?

What else could be done - is there any sense in deleting the Algolia account and creating a new one?

shortcuts commented 3 years ago

Hey @MateuszDabrowski,

Can it be due to Cloudflare? Again, I haven't changed anything in its configuration for a long time, but maybe there is an option to allowlist the crawler in the firewall settings?

It definitely could, but I assume some of our users might have a configuration similar as yours. In case you need to whitelist a user agent, we use Algolia DocSearch Crawler for our scraper, you can also override it by adding user_agent: "foo" to your config file.

What else could be done - is there any sense in deleting the Algolia account and creating a new one?

The Algolia account will receive the retrieved records, so it won't have any effect on this.

You could try hosting your website on Netlify and update your config file with the new start_urls, to see if it's related to your hosting solution

MateuszDabrowski commented 3 years ago

Allowlisted the crawler in Cloudflare and asked GitHub Support to validate whether there was any block deployed from their side.

MD_Safari app-Firewall | mateuszdabrowski pl | Account | Cloudflare - Web Performance Security_2021-07-18_20-37-40

If this doesn't help, I will deploy to Netlify and check whether it changes anything.

shortcuts commented 3 years ago

Hey @MateuszDabrowski, a new crawl has been started 20 hours ago and failed for the same reason :( I'm not really sure what's the issue, but I guess deploying on Netlify will fix it

MateuszDabrowski commented 3 years ago

Ahoj @shortcuts,

Sorry for the long time without response, but was checking some things:

Update to Docusaurus Beta 4 did not solve the issue
GitHub Pages engineers searched the logs for requests made to mateuszdabrowski.pl and "did not find any instances where we returned a 503 error. Additionally, we were unable to see any requests in the past 14 days coming from the Algolia crawler (identified using their documented user agent)". Their sum up is that "Based on these findings, we would recommend verifying with Algolia that their integration is working. If it is, please provide us with a date/time when their crawler received a 503 status from us and we can investigate further."

Based on the above, it seems that the issue is rather either on Algolia or Cloudflare side.

For now I updated the Cloudflare firewall rule from the previous to more broad one:

MD_Safari-Firewall | mateuszdabrowski pl | Account | Cloudflare - Web Performance Security_2021-07-29_17-24-38

The previous configuration wasn't triggered:

MD_Safari-Firewall | mateuszdabrowski pl | Account | Cloudflare - Web Performance Security_2021-07-29_17-27-24

shortcuts commented 3 years ago

Hey @MateuszDabrowski

Here's a snippet of the last error returned, this is on one of the URLs but it also happened on all the start_urls:

  {
    "textPayload": "2021-07-29 10:09:56 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://mateuszdabrowski.pl/sitemap.xml> (failed 3 times): 503 Service Unavailable\n",
    "timestamp": "2021-07-29T10:09:56.832469069Z",
    "severity": "ERROR"
  }

edit: I've deployed a new instance for your index, please check your mails for the new credentials

MateuszDabrowski commented 3 years ago

Deployed the website with the new apiKey. Fingers crossed :)

As for now, in the Dashboard, I see: No indices yet (Search) and Access Restricted (Overview).

MateuszDabrowski commented 3 years ago

As of now:

Search is not working at all (previously it was showing no results, now it's just spinning infinitely)
There is still no index found in Algolia Dashboard
Cloudflare still didn't see any request coming through broaden allowlist rule.

shortcuts commented 3 years ago

Hi @MateuszDabrowski,

Same 503 error, please try to deploy it on something like Netlify so we can identify the problem. Thanks!

(no indices etc. will be created if we aren't able to crawl your website, that's why you don't see anything in the Dashboard)

MateuszDabrowski commented 3 years ago

EDIT: Checked Cloudflare again and now I can see request coming through Cloudflare (from different IP and with different User Agent that the ones mentioned in your documentation):

MD_Safari app-Firewall | mateuszdabrowski pl | Account | Cloudflare - Web Performance Security_2021-07-30_10-34-04

Can you provide any more details on error 503? Screenshot or other data that could help understand which system is returning it?

shortcuts commented 3 years ago

The user agent seems to be the same as the one in the documentation: https://docsearch.algolia.com/docs/config-file/#user_agent-optional

Snippet of the errors, same goes for all the start_urls in your config:

2021-07-30 10:26:39.109 CEST [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET http://mateuszdabrowski.pl/docs/> (failed 3 times): 503 Service Unavailable
2021-07-30 10:26:39.109 CEST [mateuszdabrowski] ERROR: Http Status:503 on http://mateuszdabrowski.pl/docs/
2021-07-30 10:26:39.109 CEST [mateuszdabrowski] ERROR: Alternative link: https://mateuszdabrowski.pl/docs/
2021-07-30 10:26:39.292 CEST [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://mateuszdabrowski.pl/docs/> (failed 3 times): 503 Service Unavailable
2021-07-30 10:26:39.292 CEST [mateuszdabrowski] ERROR: Http Status:503 on https://mateuszdabrowski.pl/docs/

Crawling issue: nbHits 0 for mateuszdabrowski

MateuszDabrowski commented 3 years ago

Indeed, seem I was looking at standard Algolia User-Agent and got that mixed up. Either way, both are whitelisted with the current configuration.

I also got a response from the GitHub Pages Engineering team:

Our engineers double-checked on this and verified that they’ll are not seeing any requests from Algolia hitting our CDN, though we do see traffic from the Bing crawler.

Please ask Algolia for more details about the error, such as logs for the response headers + payload.

Do you have any more detailed logs than the ones above?

Especially HTML response body that could help understand the error source as per this document: https://support.cloudflare.com/hc/en-us/articles/115003011431-Troubleshooting-Cloudflare-5XX-errors#503error ?

In the meantime, I will also ask Cloudflare for logs to see whether, despite the allowlist, there might be some other issue on that side. But for that, I will need the above response body.

MateuszDabrowski commented 3 years ago

@shortcuts Good news: it seems that the index is now available and the search works again. Bad news: results seem quite inconsistent.

For example: Searching for TOP doesn't provide results from the SQL Select article that has it described (with TOP in h2). Searching for SELECT also doesn't provide SQL Select page, but other SQL documentation pages. DISTINCT doesn't have any result despite having its own h2.

On the other hand Template returns correctly SSJS Script Template that was added just yesterday.

Similarly, AMPScript in SSJS doesn't return a page with the same title, but only other pages that link to it.

shortcuts commented 3 years ago

Hey @MateuszDabrowski, it looks like after I've added the js_render option to your config, it solved your issue. We now start the crawl with a selenium instance which seems to bypass the protection.

For the missing content, in case it is already indexed (see the selectors in your config), see this part of the doc: https://docsearch.algolia.com/docs/config-file/#js_wait-optional

Feel free to update your config and open a pull request, I'll review it!

MateuszDabrowski commented 3 years ago

Ok, fingers crossed next run will pick up missing pages, as the general config looks good. I added some elements based on the documentation that might help improve the results and stability with Cloudflare: https://github.com/algolia/docsearch-configs/pull/4431

shortcuts commented 3 years ago

Cool! I'm closing this issue then, feel free to let me know if there's any issue.

MateuszDabrowski commented 3 years ago

I see the crawl was triggered after the merge and it seems it went good for all linked files but was blocked with JS Challenge for all page views.

MD_Safari app-Firewall | mateuszdabrowski pl | Account | Cloudflare - Web Performance Security_2021-07-30_23-44-03

Seems like Bot fight mode is overzealous and overwrites even the allowlist.

Switching it off to check if this will finally solve all the issues (perhaps even without js_render then).

MateuszDabrowski commented 3 years ago

Can confirm that the next crawl after the bot fight mode was switched off, all data seems to be picked up. So next test would be to merge the non-js_render config (as it was working in the past) to lower the overhead on the Algolia side.

shortcuts commented 3 years ago

Nice! Just merged it, let me know :) and thanks for investigating

MateuszDabrowski commented 3 years ago

On my side, in Cloudflare, I see all crawls with Allow state and search working correctly on the page. Unless you see any 503 on your side, it looks like all is good finally :)

The sum up of the issue seems to be:

If you are using Algolia and Cloudflare, the Bot Fight Mode might block the Algolia Crawler (fully the standard one and, to some extent, also the js_render one). This is happening even if the Algolia Crawler is allow listed correctly in Firewall Rules, as Bot Fight Mode is applied after them and blocks even Allowed Crawlers. Currently, there is no option to have the Bot Fight Mode enabled and Algolia Crawler fully working.

Thank you so much for the support and debugging the issue!

algolia / docsearch-configs

[mateuszdabrowski] Search doesn't return any results #4315