Linbreux / wikmd

A file based wiki that uses markdown
https://linbreux.github.io/wikmd/
MIT License
340 stars 37 forks source link

Docker image is huge #68

Closed Oliver-Hanikel closed 2 years ago

Oliver-Hanikel commented 2 years ago

The image is way too big for a wiki that tries to be lean.

Architecture Size¹
amd64 448MB
armhf 563MB
aarch64 703MB

¹Size of #67 images

I'll try to improve this by using alpine as the base image. But there would also be other ways like reducing the amount of dependencies or switching out dependencies. For example the installed size of pandoc is 100MB on arm64 but markdown would only take up 57KB.

Linbreux commented 2 years ago

@Oliver-Hanikel Yes I know. The image size can be reduced drastically, but I haven't looked into that yet... For now I used pandoc because all my notes had math and latex embedded into them. Pandoc was the one I used at the time that fully supported al my needs. If there are other option that support latex etc. I'll be happy to hear that!

Oliver-Hanikel commented 2 years ago

Is Latex support even needed in the markdown converter? Isn't mathjax used in the frontend for the conversion? If Latex is not needed we could switch to pymd4c. md4c also is according to themselves the fastest markdown converter there is.

Oliver-Hanikel commented 2 years ago

After switching to python:alpine as base image:

Architecture Size¹
amd64 338MB
aarch64 395MB

But this does not work for armhf, because the pandoc package does not exist for armhf.

Linbreux commented 2 years ago

Is Latex support even needed in the markdown converter? Isn't mathjax used in the frontend for the conversion? If Latex is not needed we could switch to pymd4c. md4c also is according to themselves the fastest markdown converter there is.

I wrote the math of all my notes with latex syntax and I use references (https://github.com/Linbreux/wikmd/blob/main/wiki/How%20to%20use%20the%20wiki.md) also image sizing is easy. When we have developed a macro system I wouldn't mind changing from pandoc to anther one. But for know I personally use to many functionality from pandoc. But I'll take a look at pymd4c, thanks for the suggestion!

Linbreux commented 2 years ago

That's an whole improvement!

But this does not work for armhf, because the pandoc package does not exist for armhf.

Hmm could we create one ourself from source?

Oliver-Hanikel commented 2 years ago

I managed to remove BeatifulSoup, Markdown and Pandoc from the dependencies and added PyMD4C as a replacement. They didn't promise too much, it really is blazingly fast. The Example documents took 400-800ms to render with pandoc on my laptop. MD4C manages to render them in 2-8ms. But there are still a few things that aren't working:

Most of these are probably fixable with the DOM Parser and a bit of work. Currently I am using the, basically completly in C implemented, HTMLRenderer so switching to the DOM Parser will probably make the rendering a bit slower. If someone wants to test it here is the branch.

Architecture Size
amd64 124MB
armhf 105MB
aarch64 123MB
Linbreux commented 2 years ago

@Oliver-Hanikel Interesting! Like I said, this would be an interesting implementation. @kura Implemented a cache system which should speed up loading times drastically. When it's possible to use all the features in features.md with another html renderer, we could switch. I don't think it would be a smart move to remove features yet.

Pandoc is not the best option, but it supports ton's of features https://pandoc.org/MANUAL.html#pandocs-markdown

kura commented 2 years ago

I personally don't see a problem with an image that is 400MB+ in size given my the documents and uploads in my wiki are already 200MB+ in size.

As for replacing pandoc, I think Markdown would be a good alternative, since it has support for the ToC feature in development and it's already in use in the Whoosh search feature. Any removal of BeautifulSoup would mean needing a tool that is capable of converting Markdown to plaintext directly to replace the Markdown -> HTML -> Plaintext step done in the search module to make the content indexable in a way that is searchable. I should also add that the Markdown library makes it very easy to write your own extensions which would be a simple way to implement any macros you want.

Markdown does have an extension that could be used to handle the LaTeX which may mean everything in features.md is supportable with a library like Markdown rather than pandoc.

kura commented 2 years ago

So, I just checked and even the smallest LaTeX library that can be used by the Markdown-LaTeX extension is 160MB alone so it's not that much of an image size reduction.

Oliver-Hanikel commented 2 years ago

When we have developed a macro system I wouldn't mind changing from pandoc to anther one. But for know I personally use to many functionality from pandoc.

Yeah my branch definitely isn't ready for usage, there are too many features missing. It is more of an experiment.

I personally don't see a problem with an image that is 400MB+ in size given my the documents and uploads in my wiki are already 200MB+ in size.

Well it is much faster to download new images, also the image generally builds faster now. I am running wikmd on a Raspberry Pi 3B with pretty small markdown files so I prefer a leaner docker image. A smaller image wears out the sd card only as much as needed, so it has a longer lifetime.

Any removal of BeautifulSoup would mean needing a tool that is capable of converting Markdown to plaintext directly to replace the Markdown -> HTML -> Plaintext step done in the search module to make the content indexable in a way that is searchable.

You can either do this with pyMD4C as shown here or with the HTMLParser from the standard library, which is also the parser BeatifulSoup uses in the current version of wikmd. Here is a working version of that.

I am now looking into using TinyTex in the docker image to make it smaller while still using pandoc.

kura commented 2 years ago

Just FYI I made a very small set of changes that replaces 90% of the pandoc functionality using the python-markdown library. Only thing that isn't properly working is the latex functionality. I tried using a ~400MB install of texlive to handle the latex stuff but it isn't properly detecting a handling things like |- **wiki** $\leftarrow$ This folder

As a note, it also hooks in to some of the markdown extensions to add in things like Table of Content support using the built-in toc extension.

Maybe something like TinyTex as mentioned would be a better solution and would maybe fix some of the LaTeX issues? I may give it a try later.

I had not thought about using the built-in HTMLParser for search... I'll give that a whirl now.