gatsbyjs / gatsby

The best React-based framework with performance, scalability and security built in.
https://www.gatsbyjs.com
MIT License
55.26k stars 10.31k forks source link

Guide plugin authors to write source plugins that correctly interact with the cache #11785

Closed johno closed 3 years ago

johno commented 5 years ago

This is related to #11747 as an isolated, reproducible case of caching issues. cc/ @pieh @DSchau

Description

Numerous plugins in the ecosystem aren't interlinking a node and the parent on onCreateNode which causes nodes in the cache to suddenly go missing. I stumbled across this issue working on a proof of concept with gatsby-source-instagram-all but I'm sure this affects plenty of other plugins.

It's not currently intuitive how to properly set up data sourcing in ways that interact well with the cache.

These issues are unfortunate because they create distrust in the cache (rightfully so). Logging warnings and helping to guide plugin authors towards the right path will help alleviate caching woes.

Steps to reproduce

Use a plugin like the reproduction and stop, then start, the server.

Expected result

Restarting the server and running the build should work.

Actual result

Restarting the server and running the build require clearing the cache.

Steps to fix

Currently, the steps to fix involves rm -rf .cache 🙀.

To solve the fundamental issue here will require two steps (I think)

• We should issue a warning in onCreateNode if there's no node passed as a parent and link to a doc that elaborates on the issue • We should have a doc for plugin authors that describe the Gatsby caching system and how nodes need to be interlinked to be properly cached and avoid errors

I'm anticipating that the doc should be in the style of a guide for plugin authors that might be sourcing nodes since the existing docs (that I can find) are more low-level but I'll defer to @marcysutton here.

Related

KyleAMathews commented 5 years ago

Oh interesting! Yeah, not setting a parent in onCreateNode when creating a node is a definite no no. I'm struggling to think of when that'd ever actually make sense... if we can't think of any, we should just hard error there (or maybe warn initially to not break people's sites).

KyleAMathews commented 5 years ago

Another thing we could check for is if all (or most, say 80%+) or a plugin's nodes are deleted startup due to being stale. That'd be easy to calculate and again should be extremely uncommon so we could issue a strong warning that the plugin probably has a bug.

pieh commented 5 years ago

Yeah, not setting a parent in onCreateNode when creating a node is a definite no no.

Screenshot transformer does that - in very similar way as linked plugin (because both use createRemoteFileNode that doesn't set parent). The difference is screenshot plugin also touches those file nodes so they don't get garbage collected. We definitely need to add ability to set parent to createRemoteFileNode helper.

KyleAMathews commented 5 years ago

Ah yeah, ok so it's a weakness in the helper. Oh and we also shouldn't then delete nodes that are linked to from other nodes 🤔

That's the real problem. File nodes don't need a parent. They just need not deleted. If I'm remembering the stale node algorithm correctly, we only check for parents.

KyleAMathews commented 5 years ago

This is a pretty major bug. The stale collection stuff is our GC step but since we're not collecting all references, we're deleting too much.

https://github.com/gatsbyjs/gatsby/blob/7d31fe75f8d73bb52e7f10384033429bafbc8186/packages/gatsby/src/utils/source-nodes.js

I'm not sure though how to cheaply find ___NODE references since recursing through every node would be very expensive.

pieh commented 5 years ago

I'm not sure though how to cheaply find ___NODE references since recursing through every node would be very expensive.

Not ideal, but we do recurse through each node when we produce example value for schema inference. But using that has potential to easily introduce regressions when changing seemingly unrelated code.

I honestly am not sure we should do that (shield ___NODE linked nodes from being garbage collected) - especially with onCreateNode callback that is really uniquely fitting for transformer plugins. The file node that is created and linked there is really derived from "parent" node so it makes sense to enforce that this File node is child.

I guess maybe instead of changing createRemoteFileNode to add support for setting parent, we could adjust createParentChildLink action to set both children on parent, and parent on child? https://github.com/gatsbyjs/gatsby/blob/e651e9c1a4aef644801305fdc7ff4843c18d8267/packages/gatsby/src/redux/actions.js#L760-L773

That + warning/error containing instructions to use that action seems like it would solve this without adding overhead of recursing nodes to find ___NODE fields

KyleAMathews commented 5 years ago

Hmm thought of a cheap way to do the fix. Just track which nodes are created in onCreateNode from which node. This would be an implicit parent/child relationship even if the data wasn't modeled that way. We could wrap the createNode action to save this. Then in the GC step, we could do a deep search on the remaining parent nodes for links.

sidharthachatterjee commented 5 years ago

So I just tried to reproduce this with gatsby-source-instagram-all and can't seem to be able to. It is consistently working for me.

redux-state.json is written in .cache correctly and onCreateNode is being called everytime the server is started.

Any ideas?

sidharthachatterjee commented 5 years ago

Turns out the Instagram API returns tags for an image in random order every time. So content digest is different every time! That's why they were getting marked as dirty and couldn't reproduce.

sidharthachatterjee commented 5 years ago

Hmm thought of a cheap way to do the fix. Just track which nodes are created in onCreateNode from which node. This would be an implicit parent/child relationship even if the data wasn't modeled that way. We could wrap the createNode action to save this. Then in the GC step, we could do a deep search on the remaining parent nodes for links.

This was published in

- gatsby-remark-graphviz@1.0.7
 - gatsby-source-drupal@3.0.24
 - gatsby-source-filesystem@2.0.21
 - gatsby-source-shopify@2.0.15
 - gatsby-source-wordpress@3.0.37
 - gatsby-transformer-screenshot@2.0.12
 - gatsby@2.1.11
johno commented 5 years ago

Going to close this since rather than guiding we've been able to fix it with code itself. Thanks @KyleAMathews, @pieh and @sidharthachatterjee!

pieh commented 5 years ago

Reopening this, because this was fixed in single instance (createRemoteFileNode) and not generally, so users can still hit this - as demonstrated in #12981

gatsbot[bot] commented 5 years ago

Hiya!

This issue has gone quiet. Spooky quiet. 👻

We get a lot of issues, so we currently close issues after 30 days of inactivity. It’s been at least 20 days since the last update here.

If we missed this issue or if you want to keep it open, please reply here. You can also add the label "not stale" to keep this issue open!

Thanks for being a part of the Gatsby community! 💪💜

smerth commented 5 years ago

Hi there,

I have an issue and I think it might fit here. I wrote a plugin called gatsby-source-gh-readme, which you can find on NPM gatsby-source-gh-readme and the repo is on github here gatsby-source-gh-readme.

The plugin makes a graphql call to the github graphql api and should pull in the file contents of the README.md file at the root of each project in a github user's account and then create a node (of mediaType: "text/markdown") for each in Gatsby.

The plugin works, for each README.md file Gatsby creates a node of type markdown and then those node are processed by the remark plugins just like the blog post markdown files...

However, frequently (almost every time,) I need to clear the cache before running gatsby develop or the build will fail. If I clear the cache prior to running gatsby develop the build succeeds.

I prepare the data for createNode() with this function:

// Helper function that processes a repository node to match Gatsby's node structure
  const processRepo = repo => {
    const nodeId = createNodeId(`repo-readme-${repo.id}`);
    const readme = repo.readme;
    const nodeData = Object.assign({}, repo, {
      id: `repo-readme-${nodeId}`,
      parent: null,
      children: [],
      internal: {
        mediaType: "text/markdown",
        type: `GithubReadme`,
        content: readme,
        contentDigest: createContentDigest(repo)
      }
    });
    return nodeData;
  };

you can see that parent is null and children is an empty array. I don't understand what parent should be, or indeed what children should be…

Apart from the fact that there is probably a better way to write this plugin, it does seem to be example of what your talking about in this issue.


Apart from flagging this as a potential example of the problem discussed in this issue. I have some general questions about this plugin that are not related:

maiertech commented 4 years ago

Here is an example of a theme that creates tag nodes that disappear from the cache for no obvious reason: https://github.com/reflexjs/reflex/issues/39.