Unable to crawl the new release of the website

algolia / docsearch

:blue_book: The easiest way to add search to your documentation.

https://docsearch.algolia.com

MIT License

3.93k stars 381 forks source link

Unable to crawl the new release of the website #1858

Closed Prarthanav closed 2 months ago

Prarthanav commented 1 year ago

Description

We had implemented docSearch for our open source website hyperswitch.io/docs, the indexing was done according to the default settings initially, we recently did a deployment for our website with new UI and a few more information. We customised the recordProps as well. When we tried to re-start the crawler, none of the records were being indexed. The error: startUrl is being ignored. We are unable to crawl the new version of the website and we're unable to debug why the issue is coming up in the first place.

Environment

OS: [e.g. Windows / Linux / macOS / iOS / Android]
Browser: [e.g. Chrome, Safari]
DocSearch version: [e.g. 3.0.0]

shaneafsar commented 1 year ago

Hi there! We received your support request through email and will get back to you soon.

sbellone commented 1 year ago

Hi, your website is blocking the requests coming from some specific IPs. Here are the results when trying to access it with an OVH IP for example:

$ curl -I https://hyperswitch.io/docs/
HTTP/2 404 
content-type: text/html
content-length: 13951
date: Tue, 25 Apr 2023 08:11:06 GMT
last-modified: Fri, 21 Apr 2023 10:06:18 GMT
etag: "793d1556f4ef67145603870aefb1fca7"
x-amz-server-side-encryption: AES256
x-amz-meta-deployment-id: 2023-04-21-8ff5ce9f2b840649f248d405ab4157ff8cc9515e
accept-ranges: bytes
server: AmazonS3
vary: Accept-Encoding
x-cache: Error from cloudfront
via: 1.1 c2015c52d38ccde0fdca03737208f710.cloudfront.net (CloudFront)
x-amz-cf-pop: MXP64-C1
x-amz-cf-id: ICL4hwz3ORqKjVEwh4x6I9eM-dgG-LMDkJnqwhNKO3WexI6EWZkK5g==

You should allow the crawler IP to access your website.

randombeeper commented 2 months ago

Closing this issue as the website in question no longer exists.

sbellone commented 2 months ago

It's been moved to a subdomain but it still exists:

$ curl https://hyperswitch.io/docs/ -I
HTTP/2 301 
server: CloudFront
date: Thu, 11 Jul 2024 07:20:58 GMT
content-length: 0
location: https://docs.hyperswitch.io/

But good to close nevertheless, since there never was any followup.