MarkBind / markbind

MarkBind is a tool for generating content-heavy websites from source files in Markdown format
https://markbind.org/
MIT License
135 stars 124 forks source link

Provide an easy way to exclude hidden content from algolia search index #773

Closed damithc closed 5 years ago

damithc commented 5 years ago

I have managed to try algolia integration with a non-trivial site, https://nus-te3201.github.io/2019/index.html

After some tweaks to the config file, with the help of the Algolia team, the search results are nicely categorized into Admin Info, SE Textbook, Programming Textbook.

The next challenge is to prevent hidden content (e.g., modals, popovers, unselected tabs, unexpanded panels etc.) from being indexed as they cannot be reached by clicking on a search result. Algolia provides a selectors_exclude mechanism for that but we need a way to specify those hidden content.

Is there anything we can do to make it easier? e.g., add a unique class to all such content so that they can be specified in the algolia config file easily?

damithc commented 5 years ago

Ideally, it should be possible to reach such content via search results (in which case there is no need to exclude them from the search index), but probably that's too hard to do?

damithc commented 5 years ago

Ideally, it should be possible to reach such content via search results (in which case there is no need to exclude them from the search index), but probably that's too hard to do?

i.e., the page detects the target anchor is inside a hidden element and automatically triggers that element to become visible.

yamgent commented 5 years ago

i.e., the page detects the target anchor is inside a hidden element and automatically triggers that element to become visible.

Probably would need a bit of digging around the codebase, but the most-likely algorithm seems to be do-able given time:

damithc commented 5 years ago

Probably would need a bit of digging around the codebase, but the most-likely algorithm seems to be do-able given time:

Good to hear that.

  • In the page's script, if the heading id matches an entry in the map, start calling the open() method of each associated panels (from top to bottom if nested), then jump to the heading.

Would this work for tabs too?

We also need to consider the scrolling to the target position. If not done right, scrolling could happen before opening, ending up in the wrong position. This is a problem in some existing pages already, where scrolling happens before the page has assumed its final height.

damithc commented 5 years ago

Also, we can do this in two steps.

  1. Provide an easy way to specify hidden content to Algolia. This may be needed as some users might not want to index such content even if it is possible. We'll probably need one unique identifier for each type of element. e.g. hidden-tab, hidden-panel etc.
  2. Gradually provide the ability to reach hidden content, starting with more content-heavy elements such as modals, tabs etc.
damithc commented 5 years ago

@marvinchin see if we can do at least item 1 for V2. Without it, Algolia search is pretty much unusable for our main use case CS2103 website as some of the search results are unreachable by clicking on the search result.

marvinchin commented 5 years ago

Sure, I'll take a look at this soon!

marvinchin commented 5 years ago

Some preliminary thoughts:

What the user sees should be identical to what the scraper "sees". However, in our case it seems like some content hidden to the user, but is visible to the scraper and is hence indexed.

The problem seems to be that MarkBind sites are client rendered. The DOM in the original HTML contains all the content (including hidden content), before it is handled by the client side Javascript to hide them.

The DocSearch crawler, by default, assumes sites are server rendered and thus indexes everything in the original HTML. We can update the configuration to indicate the website is client rendered to have the crawler execute the client side Javascript before indexing the content. Perhaps we should include this in the documentation for the Algolia plugin after verifying that this works.

I believe this might be a way to solve the issue of hidden content being indexed without the tedium and brittleness of tagging all hidden content with a unique class.

damithc commented 5 years ago

The DocSearch crawler, by default, assumes sites are server rendered and thus indexes everything in the original HTML. We can update the configuration to indicate the website is client rendered to have the crawler execute the client side Javascript before indexing the content. Perhaps we should include this in the documentation for the Algolia plugin after verifying that this works.

Thanks for investigating @marvinchin I'll try that option to see if that gives us the intended outcome. Yes, we should include it in our documentation, if the option indeed works.

damithc commented 5 years ago

Further thoughts: Eventually, we want contents of hidden tabs (and possibly some collapsed panels) to be searchable as their content may not be repeated anywhere else in the site. But this requires the support of step 2 above, and possibly step 1 too.

damithc commented 5 years ago

Looks like Algolia doesn't like the client-rendering option https://github.com/algolia/docsearch-configs/pull/780 I also assume indexing based on client-side rendering is not exactly reliable as it is hard to predict how long a page would take to load completely?

marvinchin commented 5 years ago

Yes, there is some variability involved with client side rendering, unfortunately 🙁. However, that can be mitigated by setting a long enough delay.

I suppose we will need to resort to adding identifying classes to avoid this. I will investigate how this can be done automatically for vue-strap elements.

marvinchin commented 5 years ago

@damithc I've updated the Algolia plugin to add the algolia-no-index class to content that will be hidden by VueStrap components. Unfortunately, I do not have access to any Algolia enabled sites so I am not able to test this independently.

Would it be possible to test if this works by:

Thanks! 🙂

damithc commented 5 years ago

Should we make the classname more general? e.g., hidden-content Testing this is going to be tricky though. I don't have the dev environment set up as so far I have only used the production version, and at the moment I'm stuck with an older version because of the href bug in the latest version.

marvinchin commented 5 years ago

I prefixed the class name with algolia since Algolia should be the only use case for it (the functionality is also implemented in the Algolia plugin), and I was thinking of avoiding unnecessary coupling with the rest of the MarkBind behaviour. Is there any other use case where we might need to use these classes outside of Algolia?

Perhaps I could catch you for a short while tomorrow to see how we might be able to test this?