artefactual / atom-docs

Access to Memory (AtoM) documentation
https://www.accesstomemory.org/docs
Creative Commons Attribution Share Alike 4.0 International
17 stars 26 forks source link

Problem: PDF indexing limit not updated in AtoM documentation #264

Closed fiver-watson closed 5 months ago

fiver-watson commented 6 months ago

First reported in the user forum, 2024-02-25: https://groups.google.com/g/ica-atom-users/c/bksYJLY1AIs/m/oOJ-XcLzDgAJ

Affected version

Affected page(s)

Error encountered

Per PR-1569 in AtoM, the PDF index limit was changed from using a MySQL TEXT field (64K limit) to a MEDIUMTEXT field (16MB limit). However, the documentation still says:

Currently, AtoM 2.x truncates PDF text after the first 65,535 bytes.

Recommended fix

  1. The number should be changed. With a 16MB MEDIUMTEXT field, this should be changed to "approximately 16,777,215 characters" (NOTE - not bytes, as it said previously)
  2. Given the confusion I saw, it might be good to clarify that this is text added to AtoM's search index from the PDF's text layer, and that surpassing the limit does not truncate the PDF being uploaded itself