Can't push to GitHub because of large search_index.json (>80MB)

giswqs commented 1 year ago

My geemap Python package has over 140 Jupyter notebooks. The website built using mkdocs and mkdocs-jupyter has a large search_index.json file (> 80 MB). As a result, the webiste can no longer be pushed to GitHub. See the error.

I looked into the search_index.json file and found that most of its of content is derived from the html files generated from each notebook that mkdocs-jupyter generates. See below an example. Text like this is probably useless for the search functionality and it increase the file size unnecessarily.

Is there a solution to this issue?

giswqs commented 1 year ago

@squidfunk Do you have any advice on this issue?

squidfunk commented 1 year ago

80mb? That's probably a little too big to be useful. Are you using Material for MkDocs 9? The size of the search index was reduced significantly, so it might help to upgrade if you didn't already. Otherwise, you could divide your project into smaller projects or use a hosted search solution like Algolia. Note that the 80mb need to be downloaded by the user's browser.

If you suspect that there's nonsense data in the search index, please provide a minimal reproduction and create an issue over at Material for MkDocs. Here's a guide how to do that. We can then look into it.

squidfunk commented 1 year ago

I've looked at your index – there are nonsense entries like:

{
      "location": "notebooks/59_whitebox/",
      "text": "(function (global, factory) { typeof exports === 'object' && typeof module !== 'undefined' ? module.exports = factory() : typeof define === 'function' && define.amd ? define(factory) : (global = global || self, global.ClipboardCopyElement = factory()); }(this, function () { 'use strict'; function createNode(text) { const node = document.createElement('pre'); node.style.width = '1px'; ... 500k more characters ...",
      "title": "59 whitebox"
    },

I'm not sure why? This is not generated by Material for MkDocs, but must come from some plugin or author-provided.

giswqs commented 1 year ago

I think all projects using mkdocs-jupyter have the same issue. Each index.html generated by each notebook is about 650K. See this search_index.json of the TiTiler project @vincentsarago. It also contains a lot of nonsense code in the search_index.json.

The reason that other projects have not really run into the issue as I do because their website only contains a small number of notebooks. My project has over 140 notebooks, multiple 650K by 140 will exceed 80M, making search_index.json a huge file with a lot of non-sense code in it.

giswqs commented 1 year ago

You can see the hundreds of index.html generated by the notebooks. The huge search_index.json file essentially combines the content of all index.html files. https://github.com/giswqs/geemap/tree/11a5fc94619e2eb3f4269b9eb2234ba8a7416acb/notebooks

squidfunk commented 1 year ago

Then this must be fixed here. The built-in search plugin filters script and style tags, which seem to be the root cause.

giswqs commented 1 year ago

@squidfunk Do you mean this is an issue with mkdocs-jupyter or mkdocs-material?

squidfunk commented 1 year ago

I think this is an issue with mkdocs-jupyter. If you can produce a reproduction without mkdocs-jupyter that exhibits the same behavior, I'm happy to look into it. Other than that, I have no knowledge of mkdocs-jupyter, so I'm unable to help.

giswqs commented 1 year ago

@squidfunk Thanks for your insights.

@danielfrg Can you look into it? Ideally, the search_index.json should remove all the nonsense code of CSS style. Otherwise, we can't use the search functionality when the number of notebooks grows.

danielfrg commented 1 year ago

If I understand correctly mkdocs-material should filter the style tags when creating the index but it's not filtering the ones coming from this plugin?

giswqs commented 1 year ago

@danielfrg Correct. I think that's what @squidfunk meant. If you look at any search_index.json generated by mkdocs-jupyter, it always contains a lot of CSS style code, which is useless for search and increases the file size unnecessarily.

danielfrg commented 1 year ago

Got it, thats pretty weird if those are supposed to be filtered

@squidfunk do you have any idea of of why this could be happening, any place so i can start taking a look at it?

veghdev commented 1 year ago

I think it's also a problem that the notebook outputs are also included in the search index.

I display html/js charts on the noteboook output, which increases the search index by 20mb/7notebook.

squidfunk commented 1 year ago

@danielfrg no idea why. As said, I can check if the search plugin needs to be adjusted when you can create a minimal reproduction, but other than that I can only guess. If you manage to create one, and the issue lies within Material for MkDocs (= only happens for this theme, and not the mkdocs and readthedocs themes), please create a new issue.

veghdev commented 1 year ago

@squidfunk I made a simple example repo and the search-index json file will be the same for all themes.

squidfunk commented 1 year ago

Please create an issue over at Material for MkDocs then (please stick to our bug reporting process).

veghdev commented 1 year ago

Please create an issue over at Material for MkDocs then (please stick to our bug reporting process).

@squidfunk Is this a Material problem? I get the "wrong" index file for the other themes (mkdocs, readthedocs) too.

squidfunk commented 1 year ago

Sorry, misread. It's very likely not a Material for MkDocs issue then – I'm not sure how this plugin alters the search index. Maybe the maintainer of this project can comment?

danielfrg commented 1 year ago

It doesn't thats why I am surprised.

squidfunk commented 1 year ago

Here's the dump from page.content – this is what the search plugin uses for indexing. The assets should definitely be moved into separate files. I'll look into why our search plugin is breaking, but I can't investigate it for the other themes.

dump.html.zip

squidfunk commented 1 year ago

Okay, I think I've isolated the problem for the built-in search plugin of Material for MkDocs. The cause seems to be that subsequent script/style/object tags are not skipped correctly due to mutations on the list of elements to be skipped. I'll issue a bugfix release when I get some time to work on it. Regardless, it's bad practice to include so many inline styles and scripts repeatedly in the page content – those inline scripts and styles should be moved to external resources.

This will fix the issue reported here for Material for MkDocs, but not for all other themes, as other themes are most likely all using MkDocs' own search plugin, including mkdocs and readthedocs. It should definitely be fixable in this plugin as well, but I'm unable to set time aside for it. Also, if somebody could create an issue over at Material for MkDocs, that would be great, so we can track it there and reference it subsequently.

squidfunk commented 1 year ago

Fixed in https://github.com/squidfunk/mkdocs-material/commit/2c7b0a3fc1b665f0f2771b99ecc02d61e0df8ec3.

squidfunk commented 1 year ago

9.1.3 is out. The issue reported in the OP should be resolved and the size of the build output should be way, way down from the reported 80MB. As mentioned in https://github.com/danielfrg/mkdocs-jupyter/issues/123#issuecomment-1467625859, this only applies to the built-in search plugin of Material for MkDocs.

I recommend reporting this issue upstream to MkDocs as well.

danielfrg commented 1 year ago

Thanks for your help!

danielfrg / mkdocs-jupyter

Can't push to GitHub because of large search_index.json (>80MB) #123