Closed JunboShen closed 2 weeks ago
Prompting with longer sequences requires sharding for the model, which is currently not supported. However, you can generate much longer, up to 500k and beyond on a single 80Gb GPU.
If you'd like to test the model with longer prompt I recommend Together's API.
Prompting with longer sequences requires sharding for the model, which is currently not supported. However, you can generate much longer, up to 500k and beyond on a single 80Gb GPU.
If you'd like to test the model with longer prompt I recommend Together's API.
could you elaborate how to generate 500k on a single 80Gb GPU, I got OOM on A100 with 3kb sequence. Thank you
@pan-genome we were able to just use the standard HuggingFace sampling API (e.g., loading with AutoModelForCausalLM.from_pretrained()
, sampling with model.generate()
) to generate 500k+ on an 80 Gb GPU.
@pan-genome we were able to just use the standard HuggingFace sampling API (e.g., loading with
AutoModelForCausalLM.from_pretrained()
, sampling withmodel.generate()
) to generate 500k+ on an 80 Gb GPU.
could you provide a working code example? thank you
Something like
model_config = AutoConfig.from_pretrained(
'togethercomputer/evo-1-131k-base',
trust_remote_code=True,
revision="1.1_fix",
)
model_config.max_seqlen = 500_000
model = AutoModelForCausalLM.from_pretrained(
'togethercomputer/evo-1-131k-base',
config=model_config,
trust_remote_code=True,
revision="1.1_fix",
)
outputs = model.generate(
input_ids,
max_new_tokens=500_000,
temperature=1.,
top_k=4,
)
May I ask the proper range for input sequence length to do the inference using the evo-1-131k-base model? I tried to use a single A100 and got CUDA Out of Memory when inputting a single sequence longer than 1000. Thank you!