allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

constrain vram utilization in citation mentions #267

Closed cmwilhelm closed 1 year ago

cmwilhelm commented 1 year ago

Unbounded input sizes (within a page) were being sent to the GPU for inference. This changeset forces a constrained batch size of 512 tokens.

Context: the citation mentions TIMO endpoint has been suffering since pointing SPP prod at it. Broadly it shows very high vram utilization that is not solved by reducing worker count per machine, and very high inference times. Both these things were seen as well in the bibentry detector, resolved with a similar patch.

Unlike that model, however, I was not able to reproduce memory spikes in a test setting with real papers, likely due to the relative rarity of papers with excessive numbers of tokens on individual pages. The issue was reproducible, however, by fabricating some pathological data and sending that to test servers. Seen below, the first spikes in activity are real papers sampled from SPP's recent work queue, and the latter spikes in activity are the same papers rewritten as being "one" page long, with all tokens.

Unbounded, existing solution, 1 worker: Screen Shot 2023-07-13 at 4 23 27 PM

With this PR's changes, 1 worker: Screen Shot 2023-07-13 at 4 23 47 PM