Unbounded input sizes (within a page) were
being sent to the GPU for inference. This changeset
forces a constrained batch size of 512 tokens.
Context: the citation mentions TIMO endpoint
has been suffering since pointing SPP prod at
it. Broadly it shows very high vram utilization
that is not solved by reducing worker count
per machine, and very high inference times. Both
these things were seen as well in the bibentry detector,
resolved with a similar patch.
Unlike that model, however, I was not able to
reproduce memory spikes in a test setting with
real papers, likely due to the relative rarity of papers
with excessive numbers of tokens on individual pages.
The issue was reproducible, however, by fabricating
some pathological data and sending that to test
servers. Seen below, the first spikes in activity are
real papers sampled from SPP's recent work queue,
and the latter spikes in activity are the same papers
rewritten as being "one" page long, with all tokens.
Unbounded input sizes (within a page) were being sent to the GPU for inference. This changeset forces a constrained batch size of 512 tokens.
Context: the citation mentions TIMO endpoint has been suffering since pointing SPP prod at it. Broadly it shows very high vram utilization that is not solved by reducing worker count per machine, and very high inference times. Both these things were seen as well in the bibentry detector, resolved with a similar patch.
Unlike that model, however, I was not able to reproduce memory spikes in a test setting with real papers, likely due to the relative rarity of papers with excessive numbers of tokens on individual pages. The issue was reproducible, however, by fabricating some pathological data and sending that to test servers. Seen below, the first spikes in activity are real papers sampled from SPP's recent work queue, and the latter spikes in activity are the same papers rewritten as being "one" page long, with all tokens.
Unbounded, existing solution, 1 worker:
With this PR's changes, 1 worker: