Index the full text of each article

The-Encryption-Compendium / TECv2

Hugo-based version of The Encryption Compendium.

https://encryptioncompendium.org

GNU General Public License v3.0

4 stars 1 forks source link

Index the full text of each article #35

Open dkg opened 3 years ago

dkg commented 3 years ago

The compendium would be really useful if the content of each article were also indexed, in addition to its abstract and bibtex entry.

For example, the xapers project indexes PDFs and lets the user perform full-text search from whatever corpus you're interested in (i recognize that xapers isn't integrated into a static site generator, but perhaps something comparable could be done here). This would let the user of the compendium identify any article that talks about (for example) "forward secrecy" or "backdoors" even if that term wasn't in the abstract.

This kind of change would probably increase the size of the javascript for searching significantly, so there might be some engineering work to be done to make this feasible/efficient.

Just wanted to flag this as something that seems like it would be a useful feature.

kernelmethod commented 3 years ago

This would definitely be a good feature to have! I'm not sure what the legal implications of storing the full text of a resource and then serving it statically are, though, even if I only provided it in a limited or condensed form. Some of the sources are closed-access so that might be a more relevant question for those resources.

In the long term the goal will likely be to stop serving the site completely statically, since eventually the entries.json file is going to be impractically large. Once that change is made it should hopefully be easier to have some more advanced features like this.

dkg commented 3 years ago

i suspect there's still a way to serve the site completely statically, even with a large entries.json -- there could be a sharding-based technique that lets people fetch the fractions of an index that they need to use for any particular search without grabbing the whole thing. entries.json wouldn't be the only thing served, of course -- you'd want to serve a compiled index, not the untransformed full text.

But even without anything fancy like a sharded json index, #27 just saved ~800KiB to the overall size of the pageload, and entries.json itself is a mere 122KiB. So without changing any of your indexing tech, the site can still afford to increase entries.json by a factor of 6 and not make the pageload any more expensive than it was before. If you were to serve entries.json with standard gzip compression you'd save even more: it compresses to 39KiB right now, so we're talking about a factor of 20.