Memory fixes for bibentry predictor.

Limits vram utilization by batching number of bibentries that are passed to model at one time (previously unbounded).

Context: our TIMO endpoint is running out of GPU memory very quickly with production traffic pointed at it. It is normal for these models to increase vram usage to some plateau while they are warming up, but the degree to which this was happening was excessive.

I pulled a paper with >100 references and ran it against a couple ad hoc endpoints. One without bibentry batching, and one with the default batching implemented here (n=5). Each had a single worker. The latter showed far lower vram and gpu core utilization, and had inference times twice as fast to boot. Moreover, the unbatched instance's vram plateaued at ~60% with just a single gunicorn worker, explaining why prod is having major issues with 2 workers.

No batching:

Screen Shot 2023-07-12 at 3 37 52 PM

With batching: Screen Shot 2023-07-12 at 3 38 02 PM

allenai / mmda

Memory fixes for bibentry predictor. #265