Anchors are being stripped out (using `sitemaps`, `linkExtractor` and `externalData`)

bojanrajh commented 1 year ago

Description

We are using Algolia Crawler UI for parsing our mixed static HTML & SPA website (using hash router). All URLs are provided in sitemaps Crawler config.

new Crawler({
  startUrls: [],
  sitemaps: ["https://example.com/sitemap.xml"],
  // ...
})

Steps to reproduce

Use a sitemap with the following content:

<!-- ... -->
<url>
  <loc>https://example.com/page.html</loc>
  <changefreq>monthly</changefreq>
  <priority>0.6</priority>
</url>
<url>
  <loc>https://example.com/subpage.html#/foo</loc>
  <changefreq>monthly</changefreq>
  <priority>0.6</priority>
</url>
<url>
  <loc>https://example.com/subpage.html#/bar</loc>
  <changefreq>monthly</changefreq>
  <priority>0.6</priority>
</url>
<!-- ... -->

... or using the static linkExtractor:

new Crawler({
  // ...
  linkExtractor: () => {
    return [
      "https://example.com/page.html",
      "https://example.com/subpage.html#/foo",
      "https://example.com/subpage.html#/bar",
    ];
  },
  // ...
})

Then run the URL Tester.

Result:

LINKS
Found 2 links matching your configuration 
 - https://example.com/page.html
 - https://example.com/subpage.html

Expected behavior

Expected result:

LINKS
Found 3 links matching your configuration 
 - https://example.com/page.html
 - https://example.com/subpage.html#/foo
 - https://example.com/subpage.html#/bar

Note those are not section anchors. Those are actual pages, correctly parsed in URL Tester with the renderJavaScript: true option when passing the full URL with the anchor.

Environment

Algolia Crawler UI

Similar issues:

shortcuts commented 1 year ago

Hey, thanks for opening the issue. https://github.com/algolia/docsearch/issues/1823 seems related.

I'll investigate if there's a way for us to differentiate hash routed pages from anchored sections

bojanrajh commented 1 year ago

Thank you for a quick response! Just for more clarity: we don't mind adding or implementing a custom linkExtractor or recordExtractor with custom set objectID. We just need those URLs to be accepted (crawling works as intended when manually running the crawl from the UI).

bojanrajh commented 1 year ago

Hey @shortcuts, any news on this one?

Somehow related, I tried to provide anchored URLs to the Crawler with externalData: ['myCSV], as described in your docs, and those URLs were again stripped down to one.

Example CSV:

url;title;content
"https://example.com/subpage.html#/foo";"Foo";"Foo content"
"https://example.com/subpage.html#/bar";"Bar";"Bar content"

Single URL under Crawler admin > External Data: https://example.com/subpage.html

I would expect the same issue would appear with your API client (JS), but I've just successfully created 2 objects containing URLs with anchors in our demo app (free plan, app ID BZSKX72NEG). However, I was not able to create admin API key for our app (DOCSEARCH plan, app ID J1Y01X9HGM) because the "All API Keys" section/tab is missing. By using the Admin API key I received error 400 - Not enough rights to update an object near line:1.

So, technically, my wild guess would be your system supports anchored URLs, they are just not supported by the crawler?

bojanrajh commented 10 months ago

Hey @shortcuts, and news about this one?

algolia / docsearch