head > meta > content escaping issue

oliviertassinari commented 5 years ago

Do you want to request a feature or report a bug?

I'm guessing it's a bug.

What is the current behavior?

The following source code,

<meta property="og:image" content="https://onepixel.imgix.net/60366a63-1ac8-9626-1df8-9d8d5e5e2601_1000.jpg?auto=format&q=80&mark=watermark%2Fcenter-v5.png&markalign=center%2Cmiddle&h=500&w=500&s=60ec785603e5f71fe944f76b4dacef08" />

, is being escaped once server side rendered:

<meta property="og:image" content="https://onepixel.imgix.net/60366a63-1ac8-9626-1df8-9d8d5e5e2601_1000.jpg?auto=format&amp;q=80&amp;mark=watermark%2Fcenter-v5.png&amp;markalign=center%2Cmiddle&amp;h=500&amp;w=500&amp;s=60ec785603e5f71fe944f76b4dacef08"/>

You can reproduce the behavior like this:

const React = require("react");
const ReactDOMServer = require("react-dom/server");
const http = require("http");

const doc = React.createElement("html", {
  children: [
    React.createElement("head", {
      children: React.createElement("meta", {
        property: "og:image",
        content:
          "https://onepixel.imgix.net/60366a63-1ac8-9626-1df8-9d8d5e5e2601_1000.jpg?auto=format&q=80&mark=watermark%2Fcenter-v5.png&markalign=center%2Cmiddle&h=500&w=500&s=60ec785603e5f71fe944f76b4dacef08"
      })
    }),
    React.createElement("body", { children: "og:image" })
  ]
});

//create a server object:
http
  .createServer(function(req, res) {
    res.write("<!DOCTYPE html>" + ReactDOMServer.renderToStaticMarkup(doc)); //write a response to the client
    res.end(); //end the response
  })
  .listen(8080); //the server object listens on port 8080

editor: https://codesandbox.io/s/my299jk7qp output : https://my299jk7qp.sse.codesandbox.io/

What is the expected behavior?

I would expect the content not being escaped. It's related to https://github.com/zeit/next.js/issues/2006#issuecomment-355917446. I'm using the og:image meta element so my pages can have nice previews within Facebook :).

capture d ecran 2018-10-12 a 14 15 26

Which versions of React, and which browser / OS are affected by this issue? Did this work in previous versions of React? 16.5.2

gaearon commented 5 years ago

Has this ever worked as you intend? Can you send a fix?

oliviertassinari commented 5 years ago

We are solving the problem this way:

import Entities from 'html-entities/lib/html5-entities'

const entities = new Entities()
const contentRegExp = /content="([^"]+)"/g
const handleContent = (match, content) => {
  return `content="${entities.decode(content)}"`
}

html = html.replace(contentRegExp, handleContent)

We spend ~1ms per request in the path. It's not too bad. I can give it a look at some point.

oliviertassinari commented 5 years ago

I have found this related issue: #6873. Digging into the implementation, the behavior comes from https://github.com/facebook/react/blob/0005d1e3f54b79fe4707fbccc44b89e0fb4ce565/packages/react-dom/src/server/DOMMarkupOperations.js#L61 ⬇️ https://github.com/facebook/react/blob/b87aabdfe1b7461e7331abb3601d9e6bb27544bc/packages/react-dom/src/server/quoteAttributeValueForBrowser.js#L17 ⬇️ https://github.com/facebook/react/blob/b87aabdfe1b7461e7331abb3601d9e6bb27544bc/packages/react-dom/src/server/escapeTextForBrowser.js#L108

Now, all the escaping tests I could find are covering the children use case: https://github.com/facebook/react/blob/b87aabdfe1b7461e7331abb3601d9e6bb27544bc/packages/react-dom/src/__tests__/escapeTextForBrowser-test.js#L23-L24 I have limited knowledge of web escaping related security issues. I don't see any harm potential with:

 const response = ReactDOMServer.renderToString(<span data-src={'&'}></span>); 
 expect(response).toMatch('<span data-reactroot="" data-src="&"></span>');

andreubotella commented 5 years ago

I have the same problem in the content of <style> elements:

const React = require("react");
const ReactDOMServer = require("react-dom/server");

console.log(ReactDOMServer.renderToStaticMarkup(
  <html>
    <head>
      <link
        href="https://fonts.googleapis.com/css?family=Source+Sans+Pro"
        rel="stylesheet"
      />
      <style>{`
        html {
          font-family: "Source Sans Pro", sans-serif;
        }
      `}</style>
    </head>
    <body>
      <p>Test.</p>
    </body>
  </html>
));

This outputs:

<html><head><link href="https://fonts.googleapis.com/css?family=Source+Sans+Pro" rel="stylesheet"/><style>
        html {
          font-family: &quot;Source Sans Pro&quot;, sans-serif;
        }
      </style></head><body><p>Test.</p></body></html>

By the parsing rules in the HTML spec (I'm consulting WHATWG here), the contents of elements style, xmp and iframe (as well as noscript, noframes and noembed when they're not being rendered) are parsed with the RAWTEXT tokenizer state, which treats everything as plaintext until it finds a matching closing tag.

Escaping the contents of style elements is, however, valid (in fact, mandatory for angled brackets) in the XML syntax of HTML; and indeed, adding an xmlns="http://www.w3.org/1999/xhtml" attribute to the <html> element results in valid XML. But if the intention of ReactDOMServer is indeed to render XML syntax, that should be explicitly noted in the documentation, because there are a number of tools (such as Next.js) which serve the output of these functions with content-type text/html.

oliviertassinari commented 5 years ago

@andreubotella This is a different problem, you should use dangerouslySetInnerHTML. Can an admin mark the comments as "resolved"?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution.

stale[bot] commented 4 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

jbraithwaite commented 4 years ago

This is not a bug in React. Using an entity reference for & (e.g. &) is the correct behavior for xhtml documents:

In both SGML and XML, the ampersand character ("&") declares the beginning of an entity reference (e.g., ® for the registered trademark symbol "®"). Unfortunately, many HTML user agents have silently ignored incorrect usage of the ampersand character in HTML documents - treating ampersands that do not look like entity references as literal ampersands. XML-based user agents will not tolerate this incorrect usage, and any document that uses an ampersand incorrectly will not be "valid", and consequently will not conform to this specification. In order to ensure that documents are compatible with historical HTML user agents and XML-based user agents, ampersands used in a document that are to be treated as literal characters must be expressed themselves as an entity reference (e.g. "&"). For example, when the href attribute of the a` element refers to a CGI script that takes parameters, it must be expressed ashttp://my.site.dom/cgi-bin/myscript.pl?class=guest&name=userrather than ashttp://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user`.

In the HTML spec you do not need to use a character reference for & as long as what follows it is not a string that forms a named character reference.

The example they give is:

<a href="?bill&ted">Bill and Ted</a> <!-- &ted is ok, since it's not a named character reference -->
<a href="?art&amp;copy">Art and Copy</a> <!-- the & has to be escaped, since &copy is a named character reference -->

Personally, I feel like React made the right call with escaping & since that works in both XHTML and HTML5.

equinusocio commented 4 years ago

In meta tags escaped paths don't work... otherwise, this bug would not have be opened.

jbraithwaite commented 4 years ago

This is the change needed to get the behavior you expect:

Replace https://github.com/facebook/react/blob/ee409ea3b577f9ff37d36ccbfc642058ad783bb0/packages/react-dom/src/server/ReactPartialRenderer.js#L383

with an escape hatch:

if (tagVerbatim === 'meta' && propKey === 'content') {
  markup = 'content="' + propValue + '"';
} else {
  markup = createMarkupForProperty(propKey, propValue);
}

This would explicitly exempt the meta tag's content attribute from being properly escaped which wouldn't help @oliviertassinari's issue of wanting <span data-src={'&'}></span>.

A more generic solution would involve having something like dangerouslySetAttributes

<span
  dangerouslySetAttributes={{__attributes: [{name: 'data-src', value: '&'}]}}
/>

This could easily lead to parsing errors and unexpected results if any value after the & is a named character reference e.g. &copy (without the ;)

Again, the issue was with the HTML parser FB was using for the Sharing Debugger, not React. It is properly parsing the escaped paths now.

Macil commented 3 years ago

@equinusocio Anything reading the meta tags from the HTML should handle decoding the HTML encoded characters that React produces here. It's a bug with anything consuming the HTML if it doesn't decode the &s correctly. If you're trying to test your work and want to read HTML-decoded value, then in dev tools element inspector, pick the meta tag element, and in the console run $0.content. That will return the proper value because the browser handles HTML decoding properly as anything parsing the HTML should.

YoannBuzenet commented 3 years ago

Hello, I'm having this problem and can't find any fix. I'm on React 17.0.1. Would anyone have a hint ?

notsidney commented 3 years ago

I’ve spent the last little while trying to figure out this issue since the encoded URLs cause LinkedIn to not see Open Graph images.

I’ve come up with a workaround for our Gatsby site that uses ReactDOMServer.renderToStaticMarkup, replacing & to & in the string, then rendering using dangerouslySetInnerHTML:

const headHtml = ReactDOMServer.renderToStaticMarkup(headComponents).replace(
  /&amp;/g,
  '&'
)

return (
  <html>
    <head dangerouslySetInnerHTML={{ __html: headHtml }} />

...

And here’s the full html.js for a Gatsby project: https://github.com/notsidney/gatsby-meta-encoded-url-workaround/blob/main/src/html.js#L6-L25

gaearon commented 3 years ago

Do I understand correctly that the issue is primarily due to the content scrapers that don’t consider encoded characters? So the React output is technically correct but some scrapers are too naïve? Or is there a reason why React behavior is technically incorrect?

ihmpavel commented 3 years ago

Yes and no. For example https://resoc.io/setup will fail showing you an image if you provide encoded characters (that means this service is unusable for React users). Same behavior is for Next.JS.

Inspired by https://github.com/vercel/next.js/issues/2006#issuecomment-900696618 <meta property="og:image" content='/api/share?one=1&two=2' /> will become <meta property="og:image" content='/api/share?one=1&two=2' />.

In the api/share.ts, (automatically) parsed request (req.query), will become an object with keys { one: 1, 'amp;two': 2 }. It is incorrectly parsed and unusable. Libraries should handle this fine, but not all of them do. It is a bigger problem, but quite unique.

Unfortunately there is no easy solution (dangerouslySetInnerHTML, decoding escaped HTML... - it does not suit for every problem).

gaearon commented 3 years ago

Yes and no

"No" to which part?

For example https://resoc.io/setup will fail showing you an image if you provide encoded characters (that means this service is unusable for React users).

I understand that particular services may not work. What I'm asking is — technically would you agree that the problem is on their end, or is the problem unambiguously on our end? I don't mean to turn this into "not our problem, bye" kind of argument, but I think we need to get clarity on the ideal situation before we can move forward with any plan here. So is React producing technically correct output (and resoc's parser is wrong) or is React producing technically incorrect output? I understand this question may not be important to your use case (it doesn't work either way) but it is important to me.

Same behavior is for Next.JS.

I don't know what you mean by this. Next.JS is using React so it's expected that it would have the same behavior. You mention "same behavior" after resoc.io which makes it sound like you're saying Next.JS is also incorrectly parsing something, but Next.JS is producing, not consuming, and it's producing via React, so I'm not sure if there's any extra meaning I'm missing here. I'd expect Next.JS to not be relevant in this issue because the issue is about React behavior itself, and discussing React alone should be enough to figure out the path forward here. But maybe I missed something!

Inspired by vercel/next.js#2006 (comment) <meta property="og:image" content='/api/share?one=1&two=2' /> will become <meta property="og:image" content='/api/share?one=1&two=2' />.

OK so maybe this is what you mean by "for Next.JS". I think you're saying: "when the URL is encoded, the request received by my API built with Next.JS contains incorrect parameter". However, this assertion is missing an important detail: what exactly is sending a request to your Next.JS API? Is this Facebook? Twitter? LinkedIn? resoc.io? This is important — I need to understand which case exactly you are trying to solve, if there is a concrete one.

In the api/share.ts, (automatically) parsed request (req.query), will become an object with keys { one: 1, 'amp;two': 2 }. It is incorrectly parsed and unusable. Libraries should handle this fine, but not all of them do.

It is unclear which libraries you're referring to here. Next.JS? I believe Next.JS's behavior is correct here — if the request itself has an encoded string, Next.JS isn't supposed to unencode it. I believe unencoding should've been done by whatever thing that parsed your meta tag. This is why I'm asking what is the exact service you're trying to make this work with. And whether the same problem exists for other services now. E.g. if FB and Twitter treat this correctly, I think this could be a strong enough argument that you should report the problem to the faulty service instead, and have them fix it.

depoulo commented 3 years ago

In my view it's just naïve scrapers, some if which are run by Facebook, see https://github.com/facebook/react/issues/6873#issuecomment-403089412

devanshj commented 2 years ago

fwiw afaict this makes you unable to write inline scripts in NextJS.

phawk commented 2 years ago

Whilst React's approach might be technically correct, it doesn't work in the real world, because we have naive scrapers, from massive organisations like LinkedIn, Slack and WhatsApp. So you can choose to be technically correct and pedantic or to have your code work, in the real world, right now. I for one choose the latter.

I'm trying to get open graph images with query params to work, now I need to look for a hacky workaround because I can't go and force these massive tech organisations to "fix" their code.

Macil commented 2 years ago

I'm wondering if most of the people who believe this is an issue for themselves are mistaken. I first found this issue because I was trying to figure out why Twitter wasn't showing an image from one of my <meta> tags. I viewed the HTML my server was rendering with React, copied the URL from the content="..." part of my meta tag, and found that it didn't work in my browser, so I assumed the reason that didn't work was the same reason Twitter wasn't loading my image. However, the reason that copied URL didn't work in my browser was just because I copied the raw HTML instead of the HTML-decoded string, and it was not the reason that Twitter wasn't loading my image. The actual issue was that I wasn't using the proper set of <meta> tags that Twitter respected. (There were no entries in my request logs from Twitter. If the issue had been that they weren't properly doing HTML decoding, then I would have seen entries in the request logs for URLs containing & where there should have been &.)

I also wonder if some people are improperly HTML-encoding some strings before giving them to React and then thinking the double-encoding is caused by React doing HTML-encoding. People of both of these situations are present in the related thread https://github.com/facebook/react/issues/6873, so it seems likely that at least a decent portion of people in this thread have the same confusions.

Before suggesting putting workarounds in React for other services' parsers, we should try to be sure the problem is not one of these other dev-side confusions, we should figure out a list of services we're sure have the issue, we should try to report the issue to them first, and only then move onto figuring out workarounds if the list of services with problems is still large.

Several people have suggested that LinkedIn is unable to handle properly HTML-encoded meta tags, but from my tests they seem to handle things fine. I created this test HTML page:

<!DOCTYPE html>
<html>
  <head>
    <title>test page</title>
    <meta property="og:title" content="Title of the article" />
    <meta property="og:image" content="/1234567.png?a=b&amp;c=e" />
    <meta
      property="og:description"
      content="Description that will show in the preview"
    />
  </head>
  <body>
    this is a test page
  </body>
</html>

The HTML for the meta og:image tag has "/1234567.png?a=b&c=e", which should decode to "/1234567.png?a=b&c=e".

I made a directory, put that file as "index.html" in it, put a random image named "1234567.png" in the directory too, I ran npx http-server . to host that directory through a local http server, and then I ran ngrok http 8080 to get a public URL that leads to that HTTP server. I then put the URL into https://www.linkedin.com/post-inspector/.

My local HTTP server's logs immediately showed the following, confirming that it correctly decoded:

[2022-02-23T23:33:31.661Z]  "GET /" "LinkedInBot/1.0 (compatible; Mozilla/5.0; Apache-HttpClient +http://www.linkedin.com)"
[2022-02-23T23:33:32.181Z]  "GET /1234567.png?a=b&c=e" "LinkedInBot/1.0 (compatible; Mozilla/5.0; Apache-HttpClient +http://www.linkedin.com)"

Additionally, the LinkedIn Post Inspector page correctly showed the detected image URL as https://a126-omitted-3fb7.ngrok.io/1234567.png?a=b&c=e.

It seems like LinkedIn is handling things correctly and working with the standard HTML-encoding behavior React is following, and that no change or workaround in React is necessary to work here with them. If there are other parts of LinkedIn that do have an issue, then the issue could be demonstrated with a test like this.

apecollector commented 2 years ago

Same issue, using og:image where the url uses query parameters, twitter sees these and sends them to my server which doesn't recognize &query=value instead of &query=value.

talentedandrew commented 2 years ago

@apecollector I had the similar experience with renderToStaticMarkup method, which escapes html tags, new line etc. I used unescape from lodash to fix it.

_.unescape('&amp;query=value');

outputs

&query=value

sc0ttdav3y commented 1 year ago

Hi all, I know this is a 4 year old issue and probably most of you have probably moved on, but I'm new to it and feel I must be missing something obvious.

I suspect we're all adding content security policies to our React apps by now, and those policies have a content value often with single quotes in them. I'm running Next.JS with static site generation, and their advice is to add it into /pages/_document.tsx. But this happens:

<meta httpEquiv="Content-Security-Policy" content="default-src 'self'" />

becomes

<meta httpEquiv="Content-Security-Policy" content="default-src &amp;self&amp;" />

What's the recommended way to do this in 2022?

devilportez commented 1 year ago

Hi, I have the same issue as @sc0ttdav3y has, but in my case single quotes are escaping with ' (Next.JS 12.3.0, ssg).

Macil commented 1 year ago

(I just tried the same thing with NextJS and the exact output I got is <meta http-equiv="Content-Security-Policy" content="default-src 'self'"/>, consistent with https://github.com/facebook/react/issues/13838#issuecomment-1246305603, so I'm going to assume the different value in https://github.com/facebook/react/issues/13838#issuecomment-1241593329 was a copy-paste error of some kind.)

@sc0ttdav3y @devilportez It should be fine that the value is html-encoded like that. The browser will decode html entities when reading the html and interpret the value of the meta tag the same way as "default-src 'self'". You should see the CSP policy be in effect: if you make the page use something denied by the pollicy (like inline styles and scripts, which NextJS uses by default) then an error will appear in the console.

BrodaNoel commented 1 year ago

Still happening in next@13.1.6 and react@18.2.0.

Kind of killing my SEO because I can't use:

<meta property="og:image" content="url-with-&" />

aaronbarker commented 1 year ago

I ran into this escaping issue and tried a hack from another issue I had also ran into. It seemed to have fixed the escaping for me. You get a few unnecessary <style></style> elements injected, but may be a better option for some than parsing the final content. YMMV


<style
  dangerouslySetInnerHTML={{
    __html: `</style>
      ${yourCode}
    <style>`,
  }}
></style>

jimrandomh commented 9 months ago

Wound up here investigating why a link-preview was missing its preview image on Twitter. The escaping turned out to be a red herring; Twitter's link-preview bot handles escaped URLs just fine. The problem was that it wasn't using our own mirror of the image, it was using a user-submitted image URL on pbs.twimg.com, which apparently blocks all bot traffic (including from Twitter's own link-preview bot).

erickreutz commented 4 months ago

Running into this issue as well using Next.js and generateMetadata <meta description="we're" /> becomes <meta description="we're" />

facebook / react

head > meta > content escaping issue #13838