quick fix of demo memory consumption

We probably need this quick fix because the demo memory issue is more serious than I initially thought: The memory consumption turns out to be 3x (instead of 2x as mentioned in #4). As a result, it would require a 8*A100/H100 80GB machine to run the 70b demo, which becomes a rather severe limitation.

This PR applies an imperfect yet quick fix: It now creates the model on CPU first, convert it to fp16/bf16, then move the model to GPU. The fix is primarily targeted for minimal side effect so that it can be landed ASAP: Only the demo itself is changed. The downside is higher host memory consumption and slower startup speed, but it now runs on machines with much less GPU memory (e.g., 8x3090 24GB).

I'm planning for a refactor of the tensor dtype/device management part which will resolve the issue in a more elegant way, but it will probably take more time and need some thorough testing to avoid breaking the existing functionalities.

Alpha-VLLM / LLaMA2-Accessory

quick fix of demo memory consumption #13