Workers should use a running tally of job token size (estimate 3-4characters per token), queue new job that would put over configured limit and stop popping until the queued job has enough room to submit job to backend.
Allows user to specify the approximate total kv-cache size of the aphrodite instance so as few jobs as possible are popped from the horde and queued in aphrodite.
-3 or 4 characters per token?
-special character handling?
-UI display(s) / log messages?
-Do you need to take requested generation size into account for the total window?
Workers should use a running tally of job token size (estimate 3-4characters per token), queue new job that would put over configured limit and stop popping until the queued job has enough room to submit job to backend.
Allows user to specify the approximate total kv-cache size of the aphrodite instance so as few jobs as possible are popped from the horde and queued in aphrodite.
-3 or 4 characters per token? -special character handling? -UI display(s) / log messages? -Do you need to take requested generation size into account for the total window?