gatsbyjs / gatsby

The best React-based framework with performance, scalability and security built in.
https://www.gatsbyjs.com
MIT License
55.26k stars 10.3k forks source link

Request: Clear out public folder on build #529

Closed jlyman closed 6 years ago

jlyman commented 8 years ago

Seems like it would be nice if Gatsby would clear out the public folder before generating new assets so that we are guaranteed a fresh and accurate copy of the site on each gatsby build.

The specific case where I ran into this and wished it worked this way was when I looked and found an extraneous manifest/index.html folder/file in the site, but couldn't figure out why. I realized I had added a manifest.json file for Android icon support, so I then moved it out but the manifest folder remained until I blew it away manually.

Following the philosophy of React's model of rebuilding structures from scratch every time, I think this would lead to less surprises and old/orphaned pages.

KyleAMathews commented 8 years ago

Curious what other's thoughts are?

This would be a breaking change as some people might have adopted workflows depending on the public folder being there. Adding a "static" directory at the same which has files that get copied over verbatim might be a good plan https://github.com/gatsbyjs/gatsby/issues/450

jlyman commented 8 years ago

FWIW, the suggestions in #450 are exactly what I followed, and where my manifest.json file, amongst others, currently lives.

KyleAMathews commented 8 years ago

@jlyman ah cool! Yeah, it seems like a really common need and a static folder a really common convention so we should bless that in core.

ivanoats commented 8 years ago

I think that's especially important because some people (like me) may like .gitignore public.

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jmknoll commented 7 years ago

Just to weigh in in case this is still under consideration...

I would tend to agree with the original poster - in addition the the benefits of of eliminating orphaned pages, a clean public folder also allows for a one-step build process, e.g. gatsby build && aws s3 cp public/ my-s3-bucket, without pushing an ever-growing bundle to a remote server.

KyleAMathews commented 6 years ago

We ended up deleting only html files on build https://github.com/gatsbyjs/gatsby/issues/1811

@jmknoll perhaps use an s3 client which does a sync upload?

mxlje commented 6 years ago

I just ran into this and think it is worth discussing again.

In my opinion the public folder should contain generated code and generated code only, so that it is safe to add it to .gitignore. Changes to the public folder have to be overwritten during the next build.

As was already mentioned, the static folder exists to copy static files (like maybe robots.txt, or a favicon) to the final build without modification.

Because gatsby doesn’t clean up the entire public folder (only HTML & CSS files), all the generated JS files for components and pages and whatnot stay there and the folder grows and grows over time. Using an upload tool that syncs a local folders to a remote is therefore fairly useless as nothing ever gets deleted. The remote server grows and grows. Yes, technically it is in sync with the local folder, but the local folder contains stale files.

When running this on a CI system this is not an issue as you start with a clean build every time, but it is an issue when running builds locally and then deploying (syncing) from there.

baba43 commented 6 years ago

I agree, that this feature would be nice to have, but it is also very simple for deverlopers to add:

// gatsby-node.js
const rimraf = require('rimraf');

const PUBLIC_FOLDER = `${__dirname}/public`;

exports.onPreBuild = () => {
  // empty /public folder
  rimraf.sync(PUBLIC_FOLDER + '/*');
};
jlyman commented 6 years ago

I'm kind of thinking of it from the perspective of the principle of least surprise. There's already starting to be an established pattern for this (only HTML files are deleted), but it seems like, without consulting the docs, this might be a "surprising" outcome.

Seems like the principle of least surprise would just be that a static site generator would generate a fresh and whole copy of the static site on each run, without selectively removing some files and not others?

I agree with @baba43 that it's an easy feature to add, but it still was a surprise to learn it didn't operate that way to start. My two cents.

pieh commented 6 years ago

Keep in mind that deleting public directory will cause regenerating responsive images when using gatsby-ransformer-sharp with every build. Right now plugin can skip that if image is already there.

jlyman commented 6 years ago

Totally shooting from the hip here, but maybe @pieh's case of cached resources would be a good situation where a separate .cache folder might be appropriate, similar to Webpack's caching strategy? Downside would be increased storage requirements for the building computer, but you might be able to resist regenerating all images or other resource-intensive operations while still maintaining a prod-ready public directory.

mxlje commented 6 years ago

I agree with @jlyman. While it certainly is easy to implement, it’s not how code generation tools are generally expected to behave.

The case of image transformation brought up by @pieh is certainly interesting. Maybe gatsby could delete everything by default but expose an API for plugins to mark certain output folders as „sticky“ during the build? These plugins would then have to implement a better algorithm to remove files that are no longer needed though.

I‘m not necessarily arguing for a „dumb“ rm -rf, but instead would mainly like builds to be more predictable. After a successful build the public folder should mirror the site exactly, not more and not less.

pieh commented 6 years ago

We could also add some logic of removing stale files (we would need to keep track what files are created during the build). But there are also some edge cases where having stale files is not a bad thing - consider situation: User visits your page and is browsing through it, while he is on your page you push update with new files and deleting old ones - now user who keeps browsing will have broken site, because files referenced in previous build that he has loaded are not there.

jlyman commented 6 years ago

Interesting, I hadn't thought about the publish change mid-browsing. I suppose it depends on if the hash of the file changes or not (haven't looked, don't know). If the page is gone because it was purposefully removed, we wouldn't want the user to be able to continue browsing to it. But if the page is legitimately still there but the user can't navigate to it anymore because the hash on the filename has changed that's less optimal.

But I guess so is keeping around what amounts to any version of the site you've ever built, in an ever expanding folder. Trade-offs both ways.

pieh commented 6 years ago

Hash changes when content changes - that's cache busting mechanism to avoid using cached stale data. Just so we are clear - I'm not against doing any work on this, but we have to be smart about it. There are for example ways to reload users browser when we push update (simplest would be periodically polling very small file with version of current site, and if it changes we can trigger browser reload).

zimmi commented 5 years ago

I just ran into this and agree with all the reasons already stated why the public folder should be cleared.

From what I read, the reasons to keep previous files around are backward compatibility with workflows that depend on this behavior and the mid-browsing-deploy issue.

I would argue that a workflow that relies on local copies on the machine of a single developer is inherently error prone: switching machines, version control issues that require re-cloning or having colleagues will break this anyway. Migration is simple: don't delete the previous files on the server if that is important.

The current solution to the mid-browsing-deploy issue (keeping old files around) isn't used consistently in practice, as can be seen by starters deleting public before each build. So I would see this as a separate issue that requires a more targeted solution, like the one @pieh suggested. If that's a problem worth solving, then that should be it's own issue.

DSchau commented 5 years ago

@zimmi could you share your workflow and why you need this folder cleared?

We recently introduced a gatsby clean command that when invoked, will clear out the .cache and public folders.

Generally speaking, keeping the public folder around (for local builds) and adding to .gitignore is a good practice. It will make your local builds faster, because images and other content is cached there so that we don't regenerate files unnecessarily, which can be a heavy operation.

Upon deploying, it's reasonable that most solutions (e.g. a CI service, Netlify, etc.) will and should be given a clean slate environment (i.e. no cache, public, etc. folders) and so the issue is a moot point. Even working from a non-clean slate, I'm not entirely sure why you need the public directory cleared? Are you running into issues?

From Gatsby's perspective, we don't really have any plans to clear out the public folder on build. Happy to hear your opinions as to why this practice should change, though.

Thanks for chiming in!

zimmi commented 5 years ago

@DSchau thank you for the explanation, much appreciated. I also didn't know about gatsby clean, nice!

To be honest, I didn't run into any actual problems with this yet, I was just surprised by the folder growing after deleting code and running build again. For me it's just another thing to remember when building for production, as I have no CI system in place yet. So my workflow would be:

  1. Run develop and write code until site looks good
  2. Run build
  3. Run serve
  4. See if the built site still looks good
    1. If not, go back to 1
    2. If yes, remember to delete public and run build again (possibly also serve out of paranoia)
  5. deploy

So I find it easier to just do a clean build than trying to remember if I deleted public or not.

Keeping the files around in public for performance reasons makes sense when this is done often, but isn't that what develop is for? I usually only run build when I'm reasonably sure that the site is done, but maybe there are use cases I'm not aware of? Like I see it, build could also be called release, then it would be obvious that the folder should always be clean.

I hope that clarifies my standpoint a little.

Edit: I just noticed that running develop also creates the public folder. I would have expected all development artifacts to be inside .cache and only the actual final build output to be inside public. Does that mean I have to delete public for a clean release every time, or will those "development-files" be cleaned up before running build?

DSchau commented 5 years ago

@zimmi I think I'd recommend just shifting your perception on this a little bit, if you're able!

For me it's just another thing to remember when building for production, as I have no CI system in place yet

That's the thing--you don't have to remember to do this, and there's really no reason to--especially because you haven't/likely won't run into issues.

Static content is cheap (you can host gigabytes of data for pennies), so a growing public folder (from old builds) is generally not a huge concern.

Additionally, to respond to this question:

Keeping the files around in public for performance reasons makes sense when this is done often, but isn't that what develop is for

Well - sorta. But you also get the benefits of caching if the public folder is already populated from develop. In other words, if you've already been using gatsby develop a build will be faster.

Like I see it, build could also be called release, then it would be obvious that the folder should always be clean.

Maybe! If this is a huge concern for you, you can do something like this, but I'll note that I wouldn't really recommend doing this

{
  "scripts": {
    "prerelease": "gatsby clean",
    "release": "npm run build"
  }
}

Hopefully this is helpful! Gatsby is flexible enough here to do whatever you want--but I'd encourage you to try and avoid removing the public folder and manually deleting things when you don't have to. Let us give you faster builds!

zimmi commented 5 years ago

Well, I guess it's just a philosophical difference then. I would rather have reproducible builds that only depend on the inputs than save a couple seconds / minutes on build time. I still believe that's the better default. That can always be optimized when the page grows so large that it matters, Gatsby is really fast (for me at least).

One potential issue is deleting sensitive information that was accidentally published. After a rebuild it wouldn't be accessible anymore by browsing, but it would still be accessible online. Much the same situation as API-keys in repositories on GitHub: you need to purge the history.

muescha commented 5 years ago

I also vote for a principle of least surprise.

public should be always fresh generated

gatsby clean not solve this because it clears also the .cache

I vote for an opt-out of the public clean if some like to preserve some files.

odinho commented 4 years ago

This caused me some issues, as we have several static site generators (moving to gatsby) and they all think they own public folder. So gatsby was deleting files it didn't own. This is one workaround:

gatsby-node.js

const fs = require("fs-extra");

const tmpPublic = "public.prevent_gatsby_delete";
exports.onPreInit = () => {
    if (fs.pathExistsSync("public")) {
        fs.moveSync("public", tmpPublic, { overwrite: true });
    }
};

exports.onPreBootstrap = () => {
    if (fs.pathExistsSync(tmpPublic)) {
        fs.moveSync(tmpPublic, "public", { overwrite: true });
    }
};
muescha commented 4 years ago

maybe then it is better if every generator has it own public dir and in the deploy step you copy over all the files to an final public folder?

odinho commented 4 years ago

maybe then it is better if every generator has it own public dir and in the deploy step you copy over all the files to an final public folder?

Indeed. But there's no way to configure output-dir in Gatsby as far as I saw by reading the source. Could have the entire projects inside different folders, but that's not A+ either. It'd also be possible to run mv public public.gatsby after each "step" so to say as well. That does feel nicer, but on the other hand it could lead to some gotchas if you don't use that npm script.

Anyway, we're aiming for this to be temporary.

pieh commented 4 years ago

I agree, that this feature would be nice to have, but it is also very simple for deverlopers to add:

// gatsby-node.js
const rimraf = require('rimraf');

const PUBLIC_FOLDER = `${__dirname}/public`;

exports.onPreBuild = () => {
  // empty /public folder
  rimraf.sync(PUBLIC_FOLDER + '/*');
};

For anyone reading this currently - please don't use this solution - it leads to problems like https://github.com/gatsbyjs/gatsby/issues/25289

If you absolutely must clear public - you also need to clear .cache directory as well (we have now gatsby clean command for doing that) - and you need to do this before running gatsby build, because onPreBuild runs too late (some files are already outputted to public dir and those get deleted) - so command like gatsby clean && gatsby build is best bet

Aditya94A commented 4 years ago

Gatsby should ensure it's never in a situation where such errors can occur.

I just deleted the public folder because it was overflowing with junk, thinking it'd just get rebuilt. But nope, a completely unrelated error and hours of googling leads to this issue that says that .cache has to be deleted too if one is going to delete public folder.

If empty public folder + non-empty cache doesn't work in any situation, why can't gatsby detect that and rebuild cache as well? And why not just do some cleaning before/after every build so the user never has to bother in the first place?

roffelsaurus commented 4 years ago

The dependency between .cache/ and public/ is an incredibly annoying one. If one attempt to set up CI for this the choice seems to be between fast builds (keep .cache and public) or a public/ folder that does not increase into eternity.

I had to go for slow builds because the size requirement of the artifact is a stringent one. And this unfortunately also leaves the clever EXPERIMENTAL_PAGE_BUILD thing out of reach because that obviously depends on caching.

My wish was that there was a kind of selective "post"-cleaning of public/ which only left the most recent build. Since public/ either has hashed js or replaced static filenames, that seems possible.

quantizor commented 4 years ago

Literally all they need to do is store the JSON and such in .cache and copy things over to public as needed. It's crazy how long this issue has been open.