[🐞] 404 in Search Console reported for /~partytown/partytown-sandbox-sw.html?{timestamp}

MTheProgrammer commented 8 months ago

Describe the bug

Hello, other people mentioned this problem, but they couldn't reproduce the bug:

In Google Search Console every crawl introduces new 404 pages for /~partytown/partytown-sandbox-sw.html?XXX url:

Reproduction

https://lucidmodules.com/~partytown/partytown-sandbox-sw.html?1706003192708

Steps to reproduce

This might vary depending on whether you've already been on this page and web worker has been installed. However, when you clean the browser cache it should be as follows.

go to page https://lucidmodules.com/~partytown/partytown-sandbox-sw.html?1706003192708
open dev console and hard reload (in Chrome on Linux this is Ctrl+R)
page returns 404
soft reload - this time page shows the worker content

404 page on hard reload/first time download:

The correct page returned after worker has been installed: partytown worker content

Browser Info

Chrome

Additional Information

Maybe adding <meta name=”robots” content=”noindex,nofollow“> to the head would solve the issue with google bot trying to index the partytown-sandbox-sw.html page.

gioboa commented 8 months ago

This solution makes sense to me. Is there any specific reason to index this html, I don't think so. Do you have the possibility to change your html file and verify if it works correctly?

MTheProgrammer commented 8 months ago

I've forked the repo and updated the code that generates the html: https://github.com/BuilderIO/partytown/commit/506199790630557bfab403399ebd3f258ab641e5

In Gatsby, these files are copied from the partytown to the static directory:

function setupPartytown() {
  const path = require("path");
  const { copyLibFiles } = require("@builder.io/partytown/utils");

  exports.onPreBuild = async () => {
    await copyLibFiles(path.join(__dirname, "static", "~partytown"));
  };
}

I'll check GSC after few days to verify whether this page is still being indexed.

MTheProgrammer commented 7 months ago

It doesn't seem to help:

gioboa commented 7 months ago

@MTheProgrammer I see 🤔 maybe is the ~partytown folder. Can you try to remove the ~ from the folder name pls?

MTheProgrammer commented 7 months ago

That's the official documentation with ~patytown directory: https://partytown.builder.io/gatsby#copy-library-files You mean to change the folder name in all places where it is used?

The page ~partytown/partytown-sandbox-sw.html is dynamic, the static folder contains only .js files:

My guess is that Google robot crawls the page without cache and without cache it returns 404 - because Partytown worker has not yet been installed.

Every page includes iframe with the link to the Partytown. However, attribute rel="nofollow" is not valid as in the anchor tag <a href=www.example.com rel="nofollow">

EDIT: I'm testing a hack with empty physical ~/partytown/partytown-sandbox-sw.html file containing noindex,nofollow directive. When worker is ready it returns the correct dynamic page.

gioboa commented 7 months ago

I see, great research. So I'm wondering how serve a different/valid html for the crawler but preserve the Partytown code in the html 🤔

f33w commented 5 months ago

did noindex, nofollow the script folder help? Facing the same issue in GSC @MTheProgrammer

BuilderIO / partytown