CMClay / metalsmith-lunr

Metalsmith plugin to integrate Lunr.js search engine.
43 stars 29 forks source link

How to strip html tags before generating searchIndex.json? #12

Open abhijeetvramgir opened 8 years ago

abhijeetvramgir commented 8 years ago

This is my lunr snippet from the build file:

.use(lunr({
        preprocess: function(content) {
        // Replace all occurrences of __title__ with the current file's title metadata.
        return content.replace(/__title__/g, this.title);
        }
 }))

How do I strip HTML tags ??

janthonyeconomist commented 7 years ago

I'm doing this for: a) strip HTML b) transliteration and c) strip punctuation:

preprocess: function(content) {
          const tr = (str) => {
            const map = {"а":"a" /* truncated for diff */ };
            let new_str = "", char, substitute, n = str.length;
            for(let i = 0; i < n; i++) {
                char = str[i]; substitute = map[char]; new_str += substitute ? substitute : char;
            }
            return new_str;
          };
          return tr(
            content.replace(/<[^>]+>/g, ' ') // Strip HTML
          ) // Transliterate foreign characters
            .replace(/[^\w]/g, ' ') // Strip Punctuation
          ;
        }

That seems to remove the HTML and punctuation from the contents; however, I think some punctuation is still getting through to the index in other fields. Is that right?