approach0 / search-engine

A math-aware search engine.
http://approach0.xyz
MIT License
344 stars 50 forks source link

Unexpected Mathjax error in SERP #22

Open GaurangTandon opened 6 years ago

GaurangTandon commented 6 years ago

Search results page


Broken Mathjax of entry 8 copy-pasted for reference:

...Put it this way \int {\frac{x}{{\sqrt { ... t {\frac{{2ax + b}}{{\sqrt {a{x^2} + bx + c} }}dx}  - \frac{b}{{2a}}\int {\frac{{dx}}{{\sqrt {a{x^2} + bx + c} }}} 
\displaystyle \frac{c}{a} - \frac{{{b^2}}}{{4{a^2}}} < 0 =  -  ... 2}. 
w32zhong commented 6 years ago

@GaurangTandon Thank you for reporting, will investigate later when I get some time.

w32zhong commented 6 years ago

@GaurangTandon Hi, the reason is actually quite simple, since the search result snippet is trying to summarize a document in a short paragraph, it has to skip some content and show as many highlighted words as possible. This leads to the problem you have seen: In all the cases that you find this problem, the content that is skipped (those will be replaced by a ... string) is in the middle of a LaTeX expression, and that is very likely to invalidate a LaTeX expression.

The current content skipping strategy is simple: Given a number of keywords in the document (within a threshold limit MAX_HIGHLIGHT_OCCURS), pad the left and right side of each keywords, those content that are not padded will be skipped, the keywords along with their "padding" will be displayed. The related logic is here: https://github.com/approach0/search-engine/blob/4780e499519677433543cb92ba8baa04b56f959a/search/snippet.c#L124-L125

One way to fix this issue is not skipping any LaTeX content, but some LaTeX content are very long and this strategy will make some snippet unacceptable long. So a more smart algorithm is needed to either include complete LaTeX clip or do not include any part of that clip if it is too long.

We can leave this issue open before a better skipping strategy algorithm is implemented.