perf: avoid cheerio in favor of regex

GalacticHypernova commented 3 months ago

Types of changes

[ ] Bug fix (a non-breaking change which fixes an issue)
[ ] New feature (a non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)

Description

This PR integrates native solution via regex to load and parse HTML, in order to improve runtime performance. Closes #341

Checklist:

[ ] My change requires a change to the documentation.
[ ] I have updated the documentation accordingly.
[ ] I have added tests to cover my changes (if not applicable, please state why)

vercel[bot] commented 3 months ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
nuxt-security	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 10, 2024 4:55pm

vejja commented 3 months ago

Hey @GalacticHypernova @Baroshem

Here is a description of how it works. The 3 main routines are 03-subresourceIntegrity, 04-cspSsgHashes and 99-cspSsrNonces. Each one is executed consecutively in this order

03-subresourceIntegrity needs to find all external script tags and detect the name of the file. Integrity hashes were already pre-computed at build time so once the name is found, the integrity tag can be inserted immediately. The regex needs to make the difference between external scripts (<script ... src="filename.js" .../>) and inline scripts (<script ...>//some code</script>) because only external resources can carry integrity. Basically the endgame is to transform <script ... src="filename.js" .../> into <script ... src="filename.js" ... integrity="xxx"/>, so this modifies the HTML source. Same for external links, except that not all links are eligible so the regex should also parse the rel attribute of the link.
04-cspSsgHashes is a bit of the contrary. Only inline scripts are hashed so the regex needs to do the opposite: only find inline elements and ignore external ones. The problem is that integrity hashes for external resources must also be added to the CSP header in the SSG case, so we still need to parse the external scripts for that purpose; However this is not extremely difficult because at the previous step all integrity tags were already inserted so it is easy to parse and find the value. Same for inline styles and links. In short, 04-cspSssgHashes does not modify the HTML, it only collects the hashes that need to be inserted in the CSP directives.
99-cspSsrNonces does not need to make the difference between external resources and inline elements, because the nonce is added to all elements. The integrity hashes were already added in the first step, and they don't need to be repeated in the CSP header so we don't care about integrity tags at all here. 99-cspSsrNonces modifies the HTML source by inserting the nonce attribute to the relevant tags.

Now at a higher level, these three routines are called at the time the html is rendered, with the render:html hook. This hook has an html variable which is an object with keys such as head, body, etc. Normally the scripts should be in head but I found out that sometimes they can be in other sections, so for the sake of completeness we inspect all sections.

Cheerio is used to parse the content of each section and find out the relevant parts of the HTML that need to be modified. Cheerio takes an HTML string as input, and transforms it into an AST tree of nodes. It then provides very convenient utilities to find elements by tag, make the difference between external and inline, inspect the attributes, modify the nodes, etc. But it is too slow.

This is why I added 02a-preprocessHtml and 99b-recombineHtml : it avoids parsing the same sections each time. So 02a-preprocessHtml starts by loading each of the sections in the cheerio parser, via cheerio.load. Each parsed section is saved under context.cheerios, which is just a temporary object to hold these results. Then 03-subresourceIntegrity, 04-cspSsgHashes and 99-cspSsrNonce can access the context.cheerios[section] temporary results, instead of having to cheerio.load the section over and over again. At the end we just recombine everything back into HTML with 99b-recombineHTML.

As you can see 02a and 99b were added afterwards, for performance reasons only. Hope this helps

GalacticHypernova commented 3 months ago

@vejja thanks for the explanation! This really does help a lot, and will be very useful for the PR. Does that mean no further modification needs to be made for sections in 02a (before the 3 routines)? No need to trim or anything like that? I was just wondering if it actually needs to be modified at all. P.S. I already have an idea in mind to handle caching these so that there won't be a need to iterate over everything again.

vejja commented 3 months ago

@vejja thanks for the explanation! This really does help a lot, and will be very useful for the PR. Does that mean no further modification needs to be made for sections in 02a (before the 3 routines)? No need to trim or anything like that? I was just wondering if it actually needs to be modified at all. P.S. I already have an idea in mind to handle caching these so that there won't be a need to iterate over everything again.

I think if you can get rid of 02a and 99b, it would be fantastic. Honestly they are a legacy of trying to quick-fix the cheerio issues

GalacticHypernova commented 3 months ago

I think if you can get rid of 02a and 99b, it would be fantastic. Honestly they are a legacy of trying to quick-fix the cheerio issues

👍 I mean, I do have a caching strategy in mind which might require 02a, I'm not sure yet, but 99b I do believe will be unnecessary.

GalacticHypernova commented 3 months ago

If I am correct (I haven't worked much with nitro's HTML hooks myself), the bodyAppend/prepend/etc return strings, right? They just return the stringified HTML? Because I see a $('script') inside a forEach so I want to make sure on what I actually need to do. In case it's an array within an array (like an array in each section) I would need to adjust the regex accordingly.

vejja commented 3 months ago

I can’t remember but I think Typescript gives you the type

Edit: it’s actually an array of strings:

https://github.com/nuxt/nuxt/blob/71ef8bd3ff207fd51c2ca18d5a8c7140476780c7/packages/nuxt/src/core/runtime/nitro/renderer.ts#L15

referenced in https://nuxt.com/docs/api/advanced/hooks#nitro-app-hooks-runtime-server-side

GalacticHypernova commented 3 months ago

Thanks! I remembered that vaguely but I also didn't get the chance to check it myself.

GalacticHypernova commented 3 months ago

@vejja in 03, can external scripts have <script src=...></script> syntax, or is it guaranteed to be <script src=.../>

vejja commented 3 months ago

@vejja in 03, can external scripts have <script src=...></script> syntax, or is it guaranteed to be <script src=.../>

I think in theory it should always be the first syntax as ')

GalacticHypernova commented 3 months ago

I mean, this would be x10 easier if the html's sections are actually one element each. I'll try it out real quick

GalacticHypernova commented 3 months ago

@vejja I have discovered some things:

It appears as though each section returns a single string in an array (it appears as though they are concatenated with \n and +).
Script elements cannot be nested, which is great, as that takes away most of the complexities with checking script within script.
Unfortunately the script contents don't have their own dedicated spot.
It appears as though my comment about the inside console.log and the likes is not even possible, which is also great as that takes away some more complexities (tested in the template), as it errors stating "invalid end tag", and also translates to dev app startup failure. This should get tested a bit more (like production env), but this probably means that we can safely assume means the actual end of the script.

vejja commented 3 months ago

@vejja I have discovered some things:

It appears as though each section returns a single string in an array (it appears as though they are concatenated with \n and +).

I think at least in head there can be several items in the array. Probably also depends on Nuxt settings such as nuxt.config.ts and useHead parameters ?

Script elements cannot be nested, which is great, as that takes away most of the complexities with checking script within script.

Unfortunately the script contents don't have their own dedicated spot.

It appears as though my comment about the inside console.log and the likes is not even possible, which is also great as that takes away some more complexities (tested in the template), as it errors stating "invalid end tag", and also translates to dev app startup failure. This should get tested a bit more (like production env), but this probably means that we can safely assume means the actual end of the script.

Very interesting, seems that your regex is good then ?

GalacticHypernova commented 3 months ago

Very interesting, seems that your regex is good then ?

I mean, I'm not entirely sure yet, becasue as I said, this still needs to get thoroughly tested to not have its own edge cases, but from what I had gathered so far, through layouts/app.vue/pages, both in templates and through useHead for programmatic script insertion, it errors on both build and dev, but I only tested it for like 20 minutes so I would try to test it a bit further before making the final verdict, but I guess for the most part this means the regex is indeed good

GalacticHypernova commented 3 months ago

@Baroshem you're probably way more experienced in nuxt than me, could I get a confirmation that doing anything with "" (like in a console.log) inside a script is disallowed and will fail to compile, meaning there is no special handling to add to such case?

vejja commented 3 months ago

@Baroshem you're probably way more experienced in nuxt than me, could I get a confirmation that doing anything with "" (like in a console.log) inside a script is disallowed and will fail to compile, meaning there is no special handling to add to such case?

Hi @GalacticHypernova I just verified, and it is not possible to include </script> in a javascript string. Trying to do so triggers a unterminated string literal Javascript error.

Baroshem commented 3 months ago

Hey guys, sorry but I was wuite busy for the last few days.

If you need my opinion on something please let me know and I will try to answer the best I can :)

GalacticHypernova commented 3 months ago

I just verified, and it is not possible to include </script> in a javascript string. Trying to do so triggers a unterminated string literal Javascript error.

Thank you! This does very much simplify it.

If you need my opinion on something please let me know and I will try to answer the best I can :)

Thanks! And it is perfectly fine!

GalacticHypernova commented 2 months ago

@vejja @Baroshem this is not yet finished, however I wanted to ask what your opinions on the current implementation is, and if there's anything I should change (the errors in the tests don't seem quite right considering I'm directly modifying them as strings and return the modifies ones so I don't quite know where the origin of the error is).

vejja commented 2 months ago

@vejja @Baroshem this is not yet finished, however I wanted to ask what your opinions on the current implementation is, and if there's anything I should change (the errors in the tests don't seem quite right considering I'm directly modifying them as strings and return the modifies ones so I don't quite know where the origin of the error is).

My opinion is that it looks good, thanks !!

The first test that fails is it('injects integrity on Nuxt root scripts') In that test, the section head has only one array item, and it is a string which is roughly in this format :

<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="modulepreload" as="script" crossorigin href="/_nuxt/entry.DYpycOZ-.js">
<link rel="modulepreload" as="script" crossorigin href="/_nuxt/index.-yXp0hiT.js">
<link rel="modulepreload" as="script" crossorigin href="/_nuxt/nuxt-link.sHJ0UhRl.js">
<link rel="prefetch" as="style" href="/_nuxt/error-404.c7Cl-q-2.css">
<link rel="prefetch" as="script" crossorigin href="/_nuxt/error-404.DFp4BJVl.js">
<link rel="prefetch" as="script" crossorigin href="/_nuxt/vue.f36acd1f.DbAQkOB5.js">
<link rel="prefetch" as="style" href="/_nuxt/error-500.DacwWOBj.css">
<link rel="prefetch" as="script" crossorigin href="/_nuxt/error-500.DKZOBwlI.js">
<script type="module" src="/_nuxt/entry.DYpycOZ-.js" crossorigin></script>

The SCRIPT_RE and LINK_RE regexes in 03-subresourceIntegrity don't find the correct strings to inject integrity attributes, which are in lines 3, 4, 5 and 11 (respectively: link to entry.js, link to index.js, link to nuxt-link.js, and script module entry.js).

Update: this is due to the presence of minus (-) hyphen in the src and href strings, which is excluded by the w regex check. I think you can delete the w check, as anything between the two " quotations mark would fit. This comment is also true for the integrity checks

GalacticHypernova commented 2 months ago

Update: this is due to the presence of minus (-) hyphen in the src and href strings, which is excluded by the w regex check.

Ah I see that, yea, sorry! I missed that one, the isssue was it defaulted to range as I forgot to escape the -. Updating right now!

I think you can delete the w check, as anything between the two " quotations mark would fit.

Well, I could, but I thought it would be better to provide a more focused pattern to prevent as many possible edge cases, in stead of resorting to something generic like .

GalacticHypernova commented 2 months ago

Update: I fixed some of the tests now, there was a slight misconfiguration in the regex. I'm still looking into the ones left

vejja commented 2 months ago

I think you can delete the w check, as anything between the two " quotations mark would fit.

Well, I could, but I thought it would be better to provide a more focused pattern to prevent as many possible edge cases, in stead of resorting to something generic like .

I just checked and file names can include any UTF-8 character (including accents such as éùö, symbols such as €§ and Chinese characters such as 上海+中國). The list would be too long to assemble properly. So let's go for the generic . With probably the benefit of being faster ?

GalacticHypernova commented 2 months ago

Well, in that case I will rework it. We'll just need to add thorough tests to make sure there are no regressions.

Baroshem commented 1 month ago

Hey folks, any progress on this? I would love to see this being added to the upcoming 2.0.0 release :)

GalacticHypernova commented 1 month ago

Hey! Sorry for the slow progress, been very busy with real life matters. I'm currently trying to debug the errors, I believe those are the only blockers. If you have any idea why they may be popping up I could push a fix, maybe even today.

GalacticHypernova commented 1 month ago

@vejja do you have any idea why the tests are failing? Nothing in this PR seems to be the cause of this. Is the issue in the unit tests?

GalacticHypernova commented 1 month ago

@vejja glad I could help! I just noticed I accidentally made as required when making the pattern, sorry about that and thanks for catching it! It's a bit hard to see that on phone :smiling_face_with_tear: If you don't mind I'll run some tests on it to make sure there's no performance impact. I didn't make changes like removing the unused files and working directly with the html object yet because I wanted the PR to work first so there won't be too much work to backtrack when debugging.

GalacticHypernova commented 1 month ago

I have 2 main candidates for the optional as pattern: We can use /<link(?=[^>]+\brel="(stylesheet|preload|modulepreload)")(?=[^>]+\bintegrity="([\w\-+/=]+)")(?=[^>]+\bas="(\w+)"|)[^>]+>/ But we can also use the pattern @vejja suggested, which is /<link(?=[^>]+\brel="(stylesheet|preload|modulepreload)")(?=[^>]+\bintegrity="([\w\-+/=]+)")(?=(?:[^>]+\bas="(\w+)")?)[^>]+>/

Any preferences? The only difference is that in the first pattern the positive lookahead adds an empty alternative with |, whereas the second one makes the inner pattern optional, so it would essentially do the same and check for either the as attribute or an empty alternative.

In terms of performance, from what I tested they perform similarly, but I guess only time will tell with full-fledged websites.

vejja commented 1 month ago

Can you give an example where the 2 patterns would give a different result ?

GalacticHypernova commented 1 month ago

Can you give an example where the 2 patterns would give a different result ?

I can try to find some edge case but from what I have tested thus far its the same, only expressed differently. The first pattern essentially tells the regex engine to make sure there is either the as attribute or an empty alternative (which would always be true since it's a fallback), and the second pattern (the one you suggested) tells the engine to look for a potential match with the as attribute, meaning it will also accept nothing as it's only a potential match.

Baroshem / nuxt-security