Avoiding duplicate content data for performance

jlevy commented 8 years ago

Forgive me if this has been discussed — I'm new to phenomic (which looks pretty cool!) and to using React for static content.

Do folks have thoughts or suggestions on how to avoid sending the document content twice, i.e. once for the statically rendered HTML and once for the React .json? For small pages this doesn't matter, but for larger documents, it has a significant effect on download sizes. Yes, compression helps, but overall this consideration still seems to put frameworks like this at a disadvantage compared to simpler non-React static documentation sites, if you want to scale to larger docs, say 100KB+.

Shouldn't it be possible to rebuild the store from the DOM, if it's all there? (Incrementally filling it in might be another option, but as with most "static" sites, you also want the page fully browsable without JS and indexed by search engines.) All I've seen on this is a bit of older discussion here: https://news.ycombinator.com/item?id=10009570

thangngoc89 commented 8 years ago

Hi there,

I dont see the extra HTML is a problem. There is only 1 HTML file will be downloaded on first load. After that, Phenomic only downloads needed .json file for rendering others file. On first load, there is also a minimal store that contains the information needed for rebuild the current page. So no .json for the first page

jlevy commented 8 years ago

Ah interesting, thanks. Looking closer, I see the .json isn't loaded until later. So this is different from other approaches like Gatsby's then, where it all gets bundled together. Might be worth it to specifically elaborate how each resource works in docs or FAQ, as this is a distinct advantage!

thangngoc89 commented 8 years ago

Close for now. Feel free to reopen this if my answer isn't enough for you.

MoOx commented 8 years ago

Reopening to reminds us to add something more clear in the doc + faq.

jlevy commented 8 years ago

Actually, there is still the issue that the content in the index.html is duplicated. If I build a single 100k mardown file, I see the generated index.html is 466k since it contains copies of the content in three different formats: One in the raw HTML, a second initial JS state in the "body" value, and one as Markdown in the "raw" value. (The .json file also contains both "body" and "raw" copies.) Am I missing something?

MoOx commented 8 years ago

No you are not.there is an open issue about that (removing raw & rawBody by default). Will handle that in next release (hopefully in the next few days).

MoOx commented 8 years ago

Please track #189 about this.

jlevy commented 8 years ago

Cool, thanks. To recap, seems like there are potentially 3 kinds of duplication with React for static content when compared to a traditional static approach:

Downloading multiple pages when you only need one (already handled well in Phenomic)
Including raw markdown (in initial html or subsequent js/json loads) (#189)
Duplication of HTML content in JS on first load

Sounds like 1 and 2 are addressed, but 3 currently not.

I think it'd be worth adding to the FAQ: "How big is the page load with Phenomic compared to a traditional static website generator?" (covering duplication and other React/framework weight) or "Will Phenomic work with large documents?"

MoOx commented 8 years ago

has nothing to do with React. If you are thinking about stuff like Gatsby, this is just a matter of "do I want all pages or not". This was a "nope" for me, that why I worked on Phenomic ;)
like you said, #189
This is a false problem imo. Let me explain this.

HTML is in this case a static render of the DOM, and React handles the DOM for us. If you only send the HTML with no data, React will need the page data to regenerate the current state of the page in memory and check the current DOM is ok (otherwise it will throw a warning and update the DOM).

If we create HTML pages with just HTML and no data, Phenomic will have to make another HTTP request to get the data, which is imo not a good idea.
Keep in mind that repetition are well handled by GZIP, so not sure at the end the size will be very different
Keep in mind that only one HTML page will be used per user session.

jlevy commented 8 years ago

Thanks for the detailed response. Re 1, yes, and this is why I'm looking at Phenomic too. :)

Re 3: Yes, understood. Regarding compression, yes of course it helps, but I don't think it's accurate to assume gzip eliminates the cost (since for large pages the duplicate strings are far apart and the symbol buffers are pretty small by default). Just for fun, here's an example manually removing the duplicated portions on Phenomic-generated output:

  968K  index.html.full           323K  index.html.full.gz
  518K  index.html.noraw          167K  index.html.noraw.gz
  260K  index.html.noraw_nojs      85K  index.html.noraw_nojs.gz

(Admittedly server or CDN compression settings may vary.)

Anyway, the part that's unclear to me (since I'm not that experienced with React) is, surely there is a way to hack the loading process and reconstruct the state directly from the HTML without any fetch? I realize this is not standard practice but the data is right there. It'd be cool to see Phenomic have similar weight to non-React static website generators.

MoOx commented 8 years ago

We can't reconstruct safely the data. Well maybe we can by parsing the page (but this will have a cost) to read title, "page content" etc but that will be very error prone and will lead to bugs. Maybe we can't base64 the content or something to "minimize" the content, but this will have a rendering cost (maybe not a big deal, but still).

I realize this is not standard practice but the data is right there. It'd be cool to see Phenomic have similar weight to non-React static website generators.

In order to provide HTML only we have only one solid solution: remove json from the page, and lazy load it, like for other pages (but this will create another HTTP request). We can offer an option so people can choose to have data for first page in the html or to lazy load it. But we definitely need the proper JSON data.

Also we plan to support more than just HTML render for markdown (eg: react too, see #434) and with this solution, we won't be able to "read" the data by parsing (or this will be very complicated to implement...).

I would argue that HTML with JSON is probably smaller than most assets like images so it's does not look like being a big deal to me, since it's for the first page only.

I am open for ideas tho.

thangngoc89 commented 8 years ago

This is a long and very interesting discussion. Perhaps I can push #189 forward. Allow to remove raw and rawBody if this concern you a lot.

It'd be cool to see Phenomic have similar weight to non-React static website generators.

I think it'd be cool too. Phenomic HTML output is almost like an universal React app. So IMHO, it's impossible unless there is some changes in React's core.

jlevy commented 8 years ago

Agree, thanks for the discussion. I do think removing raw content (or at least providing the option to suppress it) is a very good idea. Triplicating content just seems unfortunate for big pages (esp since as you see above, you can't assume compression magically removes duplication).

Perhaps I'm still missing something, but I still don't see why it's "impossible" or all that error prone to read initial state out of the HTML. Almost all the duplication is the "body" field of the single page state written into window.__INITIAL_STATE__, but instead of serializing that whole "body" string, wouldn't it work to just initialize it with something like document.getElementsByClassName("phenomic-BodyContainer")[0].innerHTML, which is by design the same thing? I know it's a hack, but it's pretty simple, and looks like it'd cut the page size a lot, all together 323K -> 85K in the example above. :)

MoOx commented 8 years ago

What if you component replace some stuff in the html string it output? The innerHTML can be wrong and react will then make a replacement again. You might say "that's ok in my case", but we cannot just add this kind of behavior. What about other fields like date that can be formated? Title transformed into uppercase etc?

We might add a way to do what you ask but it's not a good idea to put that as the default value imo. Especially since we want to add support for non HTML rendering (this won't work at all).

Maybe we can add a way to read data from page, but that will involve a "decoding" process (that will do the opposite of what your react components are doing, so can be simple, but not all the times).

thangngoc89 commented 8 years ago

Almost all the duplication is the "body" field of the single page state written into window.INITIAL_STATE, but instead of serializing that whole "body" string, wouldn't it work to just initialize it with something like document.getElementsByClassName("phenomic-BodyContainer")[0].innerHTML, which is by design the same thing?

This is the simplest case of using Phenomic. I usually use Javascript to modify the content of the page. So the final HTML mockup isn't the same thing (which is produced by the markdown processor).

jlevy commented 8 years ago

Cool! Glad to see the raw markdown is now removed. Pardon the slowness on the discussion; I'm looking forward to trying this out again. :)

MoOx / phenomic

Avoiding duplicate content data for performance #547