allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

Memory fixes for bibentry predictor. #265

Closed cmwilhelm closed 12 months ago

cmwilhelm commented 12 months ago

Limits vram utilization by batching number of bibentries that are passed to model at one time (previously unbounded).

Context: our TIMO endpoint is running out of GPU memory very quickly with production traffic pointed at it. It is normal for these models to increase vram usage to some plateau while they are warming up, but the degree to which this was happening was excessive.

I pulled a paper with >100 references and ran it against a couple ad hoc endpoints. One without bibentry batching, and one with the default batching implemented here (n=5). Each had a single worker. The latter showed far lower vram and gpu core utilization, and had inference times twice as fast to boot. Moreover, the unbatched instance's vram plateaued at ~60% with just a single gunicorn worker, explaining why prod is having major issues with 2 workers.

No batching:

Screen Shot 2023-07-12 at 3 37 52 PM

With batching: Screen Shot 2023-07-12 at 3 38 02 PM

geli-gel commented 12 months ago

Cool! did you do adhoc endpoints with timo-tools? I haven't tried that yet... the invocation stats go to datadog?? I would guess in timo config you'll set the batch limit higher, do you think that will be a guessing game or is the adhoc deployment similar enough to real deployment that maybe (i'm guessing) 10 or 15 per batch is better?