bug: Tensors not on same device with `--no-binary install`

Describe the bug

As recommended as a stopgap measure in issue 299, I installed OpenLLM with the --no-binary flag and tried to launch and query a LLaMa 13B model.

This resulted in a Torch error.

I understand this is secondary compared to issue 299. Please discard if the patch for issue 299 also fixes this.

To reproduce

Installing OpenLLM

pip install -U --no-binary openllm-core "openllm[llama, vllm, fine-tune]"
pip install scipy
pip install protobuf==3.20.3 # this was necessary because of a common error with compilation incompatibilities

Launching the service

openllm start llama --model huggyllama/llama-13b --debug

Querying the model (from another terminal)

openllm query "What is deep learning ?"

Logs

From the first terminal (where the model was launched):

2023-09-06T11:38:16+0000 [INFO] [runner:llm-llama-runner:1]  - "GET /readyz HTTP/1.1" 200 (trace=9d00197e5e9b5c00ce7a730f7cd9083e,span=e56f8391327279cb,sampled=1,service.name=llm-llama-runner)
2023-09-06T11:38:16+0000 [INFO] [runner:llm-llama-runner:1] _ (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 0.638ms (trace=9d00197e5e9b5c00ce7a730f7cd9083e,span=1c975528a02988f7,sampled=1,service.name=llm-llama-runner)
2023-09-06T11:38:16+0000 [INFO] [api_server:29] 127.0.0.1:46644 - "GET /readyz HTTP/1.1" 200 (trace=9d00197e5e9b5c00ce7a730f7cd9083e,span=1e20d81925215734,sampled=1,service.name=llm-llama-service)
2023-09-06T11:38:16+0000 [INFO] [api_server:29] 127.0.0.1:46644 (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 79.659ms (trace=9d00197e5e9b5c00ce7a730f7cd9083e,span=6767a7ec578eef9a,sampled=1,service.name=llm-llama-service)
2023-09-06T11:38:16+0000 [INFO] [runner:llm-llama-runner:1]  - "GET /readyz HTTP/1.1" 200 (trace=76f5c09186ad1bf26becd83971437d3c,span=1d3f82fb9d3c9aa0,sampled=1,service.name=llm-llama-runner)
2023-09-06T11:38:16+0000 [INFO] [runner:llm-llama-runner:1] _ (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 0.422ms (trace=76f5c09186ad1bf26becd83971437d3c,span=870b58580b00b9f8,sampled=1,service.name=llm-llama-runner)
2023-09-06T11:38:16+0000 [INFO] [api_server:29] 127.0.0.1:46648 - "GET /readyz HTTP/1.1" 200 (trace=76f5c09186ad1bf26becd83971437d3c,span=f77657db2f5062f5,sampled=1,service.name=llm-llama-service)
2023-09-06T11:38:16+0000 [INFO] [api_server:29] 127.0.0.1:46648 (scheme=http,method=GET,path=/readyz,type=,length=) (status=200,type=text/plain; charset=utf-8,length=1) 2.945ms (trace=76f5c09186ad1bf26becd83971437d3c,span=959446bff37e8b99,sampled=1,service.name=llm-llama-service)
2023-09-06T11:38:16+0000 [INFO] [api_server:30] 127.0.0.1:46654 - "GET /docs.json HTTP/1.1" 200 (trace=57df2df3885999108733004f8e8d9f02,span=456ac29ce91bed6f,sampled=1,service.name=llm-llama-service)
2023-09-06T11:38:16+0000 [INFO] [api_server:30] 127.0.0.1:46654 (scheme=http,method=GET,path=/docs.json,type=,length=) (status=200,type=application/json,length=11034) 14.689ms (trace=57df2df3885999108733004f8e8d9f02,span=141a51bdcab81cbf,sampled=1,service.name=llm-llama-service)
2023-09-06T11:38:16+0000 [INFO] [api_server:30] 127.0.0.1:46664 - "POST /v1/metadata HTTP/1.1" 200 (trace=b05deaf686c0a32f1b08241cfd2dbcc7,span=d92c0a230b8fae10,sampled=1,service.name=llm-llama-service)
2023-09-06T11:38:16+0000 [INFO] [api_server:30] 127.0.0.1:46664 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=907) 2.542ms (trace=b05deaf686c0a32f1b08241cfd2dbcc7,span=23a1663e7f42ea11,sampled=1,service.name=llm-llama-service)
2023-09-06T11:38:16+0000 [INFO] [api_server:30] 127.0.0.1:46664 - "POST /v1/metadata HTTP/1.1" 200 (trace=d2d69c5af53bba442e26ac5ad983a07f,span=a088f3943117b389,sampled=1,service.name=llm-llama-service)
2023-09-06T11:38:16+0000 [INFO] [api_server:30] 127.0.0.1:46664 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=907) 0.927ms (trace=d2d69c5af53bba442e26ac5ad983a07f,span=5c6b6c748803f63b,sampled=1,service.name=llm-llama-service)
2023-09-06T11:38:16+0000 [INFO] [api_server:30] 127.0.0.1:46664 - "POST /v1/metadata HTTP/1.1" 200 (trace=3a7362ecad91005829ef300bf392c97c,span=037e3e619c588aa0,sampled=1,service.name=llm-llama-service)
2023-09-06T11:38:16+0000 [INFO] [api_server:30] 127.0.0.1:46664 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=907) 0.908ms (trace=3a7362ecad91005829ef300bf392c97c,span=6f6476058710b337,sampled=1,service.name=llm-llama-service)
2023-09-06T11:38:16+0000 [INFO] [api_server:30] 127.0.0.1:46664 - "POST /v1/metadata HTTP/1.1" 200 (trace=efc1c27684ee8f6dfcb4046c3c3fe2da,span=ccc6f6b1ac9e78e5,sampled=1,service.name=llm-llama-service)
2023-09-06T11:38:16+0000 [INFO] [api_server:30] 127.0.0.1:46664 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=907) 1.076ms (trace=efc1c27684ee8f6dfcb4046c3c3fe2da,span=a38eb0361f6979e0,sampled=1,service.name=llm-llama-service)
2023-09-06T11:38:16+0000 [INFO] [api_server:30] 127.0.0.1:46664 - "POST /v1/metadata HTTP/1.1" 200 (trace=b07fe0780a98ab7402526950745d5b1d,span=3ba5a16f6eacc14d,sampled=1,service.name=llm-llama-service)
2023-09-06T11:38:16+0000 [INFO] [api_server:30] 127.0.0.1:46664 (scheme=http,method=POST,path=/v1/metadata,type=text/plain; charset=utf-8,length=0) (status=200,type=application/json,length=907) 0.898ms (trace=b07fe0780a98ab7402526950745d5b1d,span=1bb9f49c56688f0a,sampled=1,service.name=llm-llama-service)
2023-09-06T11:38:16+0000 [DEBUG] [runner:llm-llama-runner:1] Starting dispatcher optimizer training... (trace=e7f29e8091c87e8c4ce71c7dc49946fe,span=8ec7bd686cbd6de6,sampled=1,service.name=llm-llama-runner)
2023-09-06T11:38:16+0000 [DEBUG] [runner:llm-llama-runner:1] Dynamic batching cork released, batch size: 1 (trace=e7f29e8091c87e8c4ce71c7dc49946fe,span=8ec7bd686cbd6de6,sampled=1,service.name=llm-llama-runner)
2023-09-06T11:38:16+0000 [INFO] [runner:llm-llama-runner:1]  - "POST /generate HTTP/1.1" 500
2023-09-06T11:38:16+0000 [ERROR] [runner:llm-llama-runner:1] Exception in ASGI application
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/ubuntu/.local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/ubuntu/.local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
    await self.app(scope, receive, send)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/ubuntu/.local/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/server/runner_app.py", line 291, in _request_handler
    payload = await infer(params)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/marshal/dispatcher.py", line 182, in _func
    raise r
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/marshal/dispatcher.py", line 377, in outbound_call
    outputs = await self.callback(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/server/runner_app.py", line 271, in infer_single
    ret = await runner_method.async_run(*params.args, **params.kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
    return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/runner/runner_handle/local.py", line 62, in async_run_method
    return await anyio.to_thread.run_sync(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/home/ubuntu/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/runner/runnable.py", line 143, in method
    return self.func(obj, *args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/openllm/_llm.py", line 1185, in generate
    return self.generate(prompt, **attrs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/openllm/_llm.py", line 943, in generate
    for it in self.generate_iterator(prompt, **attrs):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/openllm/_llm.py", line 981, in generate_iterator
    out = self.model(torch.as_tensor([input_ids]), use_cache=True)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 820, in forward
    outputs = self.model(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 662, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
2023-09-06T11:38:16+0000 [ERROR] [api_server:30] Exception on /v1/generate [POST] (trace=e7f29e8091c87e8c4ce71c7dc49946fe,span=533d6a8d3e4c547d,sampled=1,service.name=llm-llama-service)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/server/http_app.py", line 341, in api_func
    output = await api.func(*args)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/openllm/_service.py", line 50, in generate_v1
    responses = await runner.generate.async_run(qa_inputs.prompt, **{'adapter_name': qa_inputs.adapter_name, **config})
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
    return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/runner/runner_handle/remote.py", line 242, in async_run_method
    raise RemoteException(
bentoml.exceptions.RemoteException: An unexpected exception occurred in remote runner llm-llama-runner: [500] Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
    await self.app(scope, receive, send)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/ubuntu/.local/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/server/runner_app.py", line 291, in _request_handler
    payload = await infer(params)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/marshal/dispatcher.py", line 182, in _func
    raise r
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/marshal/dispatcher.py", line 377, in outbound_call
    outputs = await self.callback(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/server/runner_app.py", line 271, in infer_single
    ret = await runner_method.async_run(*params.args, **params.kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
    return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/runner/runner_handle/local.py", line 62, in async_run_method
    return await anyio.to_thread.run_sync(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/home/ubuntu/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bentoml/_internal/runner/runnable.py", line 143, in method
    return self.func(obj, *args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/openllm/_llm.py", line 1185, in generate
    return self.generate(prompt, **attrs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/openllm/_llm.py", line 943, in generate
    for it in self.generate_iterator(prompt, **attrs):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/openllm/_llm.py", line 981, in generate_iterator
    out = self.model(torch.as_tensor([input_ids]), use_cache=True)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 820, in forward
    outputs = self.model(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 662, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

2023-09-06T11:38:16+0000 [INFO] [api_server:30] 127.0.0.1:46664 - "POST /v1/generate HTTP/1.1" 500 (trace=e7f29e8091c87e8c4ce71c7dc49946fe,span=50be2c0a17a3dd19,sampled=1,service.name=llm-llama-service)
2023-09-06T11:38:16+0000 [INFO] [api_server:30] 127.0.0.1:46664 (scheme=http,method=POST,path=/v1/generate,type=application/json,length=719) (status=500,type=application/json,length=2) 146.100ms (trace=e7f29e8091c87e8c4ce71c7dc49946fe,span=533d6a8d3e4c547d,sampled=1,service.name=llm-llama-service)

From the second terminal (where the model was queried):

==Input==

What is deep learning ?
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/openllm", line 8, in <module>
    sys.exit(cli())
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/openllm/cli/entrypoint.py", line 189, in wrapper
    return_value = func(*args, **attrs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/openllm/cli/entrypoint.py", line 171, in wrapper
    return f(*args, **attrs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/openllm/cli/entrypoint.py", line 868, in query_command
    res = client.query(prompt, return_response='raw', **{**client.configuration, **_memoized})
  File "/home/ubuntu/.local/lib/python3.8/site-packages/openllm_client/_base.py", line 269, in query
    r = openllm_core.GenerationOutput(**self.call('generate', openllm_core.GenerationInput(prompt=prompt, llm_config=self.config.model_construct_env(**generate_kwargs)).model_dump()))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/openllm_client/_base.py", line 165, in call
    return self.inner.call(f'{api_name}_{self._api_version}', *args, **attrs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/openllm_client/benmin/__init__.py", line 43, in call
    return self._call(data, _inference_api=self.svc.apis[bentoml_api_name], **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/openllm_client/benmin/_http.py", line 104, in _call
    if resp.status_code != 200: raise ValueError(f'Error while making request: {resp.status_code}: {resp.content!s}')
ValueError: Error while making request: 500: b'""'

Environment

Environment variable

BENTOML_DEBUG=''
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=''
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''

System information

bentoml: 1.1.5 python: 3.8.10 platform: Linux-5.15.0-67-generic-x86_64-with-glibc2.29 uid_gid: 1000:1000

pip_packages

``` absl-py==0.15.0 accelerate==0.22.0 aiofiles==22.1.0 aiohttp==3.8.5 aiosignal==1.3.1 aiosqlite==0.18.0 anyio==3.7.1 appdirs==1.4.3 argon2-cffi==21.3.0 argon2-cffi-bindings==21.2.0 arrow==1.2.3 asgiref==3.7.2 astunparse==1.6.2 async-timeout==4.0.3 atomicwrites==1.1.5 attrs==23.1.0 Automat==0.8.0 Babel==2.12.1 backcall==0.1.0 beautifulsoup4==4.8.2 bentoml==1.1.5 bitsandbytes==0.41.1 bleach==3.1.1 blinker==1.4 blosc==1.7.0 bottle==0.12.15 build==1.0.0 cachetools==4.0.0 caffe==1.0.0 cattrs==23.1.2 certifi==2019.11.28 cffi==1.14.0 chardet==3.0.4 charset-normalizer==3.1.0 circus==0.18.0 click==8.1.7 click-option-group==0.5.6 cloud-init==22.4.2 cloudpickle==2.2.1 cmake==3.27.4.1 colorama==0.4.3 coloredlogs==15.0.1 command-not-found==0.3 configobj==5.0.6 constantly==15.1.0 contextlib2==21.6.0 contourpy==1.0.7 cryptography==2.8 ctop==1.0.0 cuda-python==12.2.0 cycler==0.10.0 Cython==0.29.14 dask==2.8.1+dfsg datasets==2.14.4 dbus-python==1.2.16 decorator==4.4.2 deepmerge==1.1.0 defusedxml==0.6.0 Deprecated==1.2.14 dill==0.3.7 distlib==0.3.0 distro==1.4.0 distro-info===0.23ubuntu1 docker==4.1.0 entrypoints==0.3 et-xmlfile==1.0.1 exceptiongroup==1.1.3 fairscale==0.4.13 fastapi==0.103.1 fastcore==1.5.29 fastjsonschema==2.16.3 filelock==3.0.12 filetype==1.2.0 flake8==3.7.9 flatbuffers==1.12 fonttools==4.39.0 fqdn==1.5.1 frozenlist==1.4.0 fs==2.4.16 fsspec==2023.9.0 future==0.18.2 gast==0.4.0 ghapi==1.0.4 Glances==3.1.3 google-auth==1.5.1 google-auth-oauthlib==0.4.1 google-pasta==0.2.0 grpcio==1.57.0 h11==0.14.0 h5py==2.10.0 html5lib==1.0.1 htmlmin==0.1.12 httpcore==0.17.3 httplib2==0.14.0 httpx==0.24.1 huggingface-hub==0.16.4 humanfriendly==10.0 hyperlink==19.0.0 icdiff==1.9.5 idna==2.8 ImageHash==4.3.1 imageio==2.4.1 importlib-metadata==6.0.0 importlib-resources==5.12.0 incremental==16.10.1 inflection==0.5.1 influxdb==5.2.0 iotop==0.6 ipykernel==5.2.0 ipython==7.13.0 ipython_genutils==0.2.0 ipywidgets==8.0.4 isoduration==20.11.0 jdcal==1.0 jedi==0.15.2 Jinja2==3.1.2 joblib==1.2.0 json5==0.9.11 jsonpatch==1.22 jsonpointer==2.0 jsonschema==4.17.3 jupyter-console==6.0.0 jupyter-events==0.6.3 jupyter-ydoc==0.2.3 jupyter_client==8.0.3 jupyter_core==5.2.0 jupyter_server==2.4.0 jupyter_server_fileid==0.8.0 jupyter_server_terminals==0.4.4 jupyter_server_ydoc==0.6.1 jupyterlab==3.6.1 jupyterlab-pygments==0.2.2 jupyterlab-widgets==3.0.5 jupyterlab_server==2.20.0 kaptan==0.5.10 keras==2.11.0 keyring==18.0.1 kiwisolver==1.0.1 language-selector==0.1 launchpadlib==1.10.13 lazr.restfulclient==0.14.2 lazr.uri==1.0.3 libtmux==0.8.2 lit==16.0.6 locket==0.2.0 lxml==4.5.0 Mako==1.1.0 Markdown==3.1.1 markdown-it-py==3.0.0 MarkupSafe==2.1.2 matplotlib==3.6.3 mccabe==0.6.1 mdurl==0.1.2 mistune==2.0.5 more-itertools==4.2.0 mpi4py==3.0.3 mpmath==1.3.0 msgpack==1.0.5 multidict==6.0.4 multimethod==1.9.1 multiprocess==0.70.15 mypy-extensions==1.0.0 nbclassic==0.5.3 nbclient==0.7.2 nbconvert==7.2.9 nbformat==5.7.3 nest-asyncio==1.5.6 netifaces==0.10.4 networkx==2.4 ninja==1.11.1 nose==1.3.7 notebook==6.0.3 notebook_shim==0.2.2 numexpr==2.7.1 numpy==1.23.5 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-cupti-cu11==11.7.101 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.2.10.91 nvidia-cusolver-cu11==11.4.0.1 nvidia-cusparse-cu11==11.7.4.91 nvidia-ml-py==7.352.0 nvidia-nccl-cu11==2.14.3 nvidia-nvtx-cu11==11.7.91 oauthlib==3.1.0 olefile==0.46 openllm==0.3.0 openllm-client==0.3.0 openllm-core==0.3.0 openpyxl==3.0.3 opentelemetry-api==1.18.0 opentelemetry-instrumentation==0.39b0 opentelemetry-instrumentation-aiohttp-client==0.39b0 opentelemetry-instrumentation-asgi==0.39b0 opentelemetry-sdk==1.18.0 opentelemetry-semantic-conventions==0.39b0 opentelemetry-util-http==0.39b0 opt-einsum==3.3.0 optimum==1.12.0 orjson==3.9.5 packaging==23.0 pandas==1.5.3 pandas-profiling==3.6.6 pandocfilters==1.4.2 parameterized==0.7.0 parso==0.5.2 partd==1.0.0 pathspec==0.11.2 patsy==0.5.3 peft==0.5.0 pexpect==4.6.0 phik==0.12.3 pickleshare==0.7.5 Pillow==7.0.0 pip-requirements-parser==32.0.1 pip-tools==7.3.0 pkgutil_resolve_name==1.3.10 platformdirs==3.1.1 pluggy==0.13.0 ply==3.11 prometheus-client==0.17.1 prompt-toolkit==2.0.10 protobuf==3.20.3 psutil==5.5.1 ptyprocess==0.7.0 py==1.8.1 pyarrow==13.0.0 pyasn1==0.4.2 pyasn1-modules==0.2.1 pycodestyle==2.5.0 pycparser==2.19 pycryptodomex==3.6.1 pycuda==2019.1.2 pydantic==1.10.6 pydot==1.4.1 pyflakes==2.1.1 Pygments==2.14.0 PyGObject==3.36.0 pygpu==0.7.6 PyHamcrest==1.9.0 pyinotify==0.9.6 PyJWT==1.7.1 pymacaroons==0.13.0 PyNaCl==1.3.0 pynvml==11.5.0 pyOpenSSL==19.0.0 pyparsing==2.4.6 pyproject_hooks==1.0.0 pyrsistent==0.15.5 pyserial==3.4 pysmi==0.3.2 pysnmp==4.4.6 pystache==0.5.4 pytest==4.6.9 python-apt==2.0.1 python-dateutil==2.8.2 python-debian===0.1.36ubuntu1 python-json-logger==2.0.7 python-multipart==0.0.6 pytools==2019.1.1 pytz==2022.7.1 PyWavelets==0.5.1 PyYAML==5.3.1 pyzmq==25.0.1 ray==2.6.3 regex==2023.8.8 requests==2.28.2 requests-oauthlib==1.0.0 requests-unixsocket==0.2.0 rfc3339-validator==0.1.4 rfc3986-validator==0.1.1 rich==13.5.2 rsa==4.0 safetensors==0.3.3 schema==0.7.5 scikit-cuda==0.5.3 scikit-image==0.16.2 scikit-learn==0.22.2.post1 scipy==1.9.3 seaborn==0.12.2 SecretStorage==2.3.1 Send2Trash==1.8.0 sentencepiece==0.1.99 service-identity==18.1.0 simple-di==0.1.5 simplejson==3.16.0 six==1.14.0 sniffio==1.3.0 sos==4.4 soupsieve==1.9.5 ssh-import-id==5.10 starlette==0.27.0 statsmodels==0.13.5 sympy==1.12 systemd-python==234 tables==3.6.1 tabulate==0.9.0 tangled-up-in-unicode==0.2.0 tensorboard==2.11.0 tensorflow-estimator==2.11.0 tensorflow-gpu==2.11.0 termcolor==1.1.0 terminado==0.17.1 testpath==0.4.4 Theano==1.0.4 tinycss2==1.2.1 tmuxp==1.5.4 tokenizers==0.13.3 tomli==2.0.1 toolz==0.9.0 torch==2.0.1 torchvision==0.14.1 tornado==6.2 tqdm==4.64.1 traitlets==5.9.0 transformers==4.33.0 triton==2.0.0 trl==0.7.1 Twisted==18.9.0 typeguard==2.13.3 typing_extensions==4.5.0 ubuntu-advantage-tools==8001 ufw==0.36 unattended-upgrades==0.1 uri-template==1.2.0 urllib3==1.25.8 uvicorn==0.23.2 virtualenv==20.0.17 visions==0.7.5 vllm==0.1.4 wadllib==1.3.3 watchfiles==0.20.0 wcwidth==0.1.8 webcolors==1.12 webencodings==0.5.1 websocket-client==0.53.0 Werkzeug==0.16.1 widgetsnbextension==4.0.5 wrapt==1.11.2 xformers==0.0.21 xlrd==1.1.0 xlwt==1.3.0 xxhash==3.3.0 y-py==0.5.9 yarl==1.9.2 ydata-profiling==4.1.0 ypy-websocket==0.8.2 zipp==3.15.0 zope.interface==4.7.1 ```

transformers version: 4.33.0
Platform: Linux-5.15.0-67-generic-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.3
Accelerate version: 0.22.0
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): 2.11.0 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

System information (Optional)

A100 40GB SXM4 instance from LambdaLabs

bentoml / OpenLLM