aws-samples / amazon-bedrock-workshop

This is a workshop designed for Amazon Bedrock a foundational model service.
https://catalog.us-east-1.prod.workshops.aws/workshops/a4bdb007-5600-4368-81c5-ff5b4154f518/en-US/20-intro
MIT No Attribution
1.26k stars 541 forks source link

Maximum input token count 4919 exceeds limit of 4096 for train data in 03_Model_customization/03_continued_pretraining_titan_text.ipynb #224

Open FireballDWF opened 3 months ago

FireballDWF commented 3 months ago

"Maximum input token count 4919 exceeds limit of 4096 for train data" in model-customization-job/amazon.titan-text-lite-v1:0:4k/nhjsh25oes0i in notebook 03_Model_customization/03_continued_pretraining_titan_text.ipynb

HiDhineshRaja commented 3 months ago

I too face the same issue. I tried to reduce the chunksize to 10000 even though getting the same error after training about 2 hours

jicowan commented 2 months ago

I am also getting this error.

nmudkey000 commented 1 month ago

I was able to fix the issue by reducing the chunk size and chunk overlap to 5000 and 1000, respectively; 5000 was an assumption, I am sure it would still create the model with anything below 10000 (someone above observed that the model would not get created for a chunk size equaling 10000):

text_splitter = RecursiveCharacterTextSplitter(

Set a really small chunk size, just to show.

chunk_size = 5000, # assumption
chunk_overlap = 1000, # overlap for continuity across chunks

)

docs = text_splitter.split_documents(document)

jimmus69 commented 3 weeks ago

+1 to this issue. Now re-trying with @nmudkey000 fix (5000/1000).

jgtavarez commented 1 week ago

same problem here, even using a chunks strategy that is below the max get the same error

Amazon says "...Use 6 characters per token as an approximation for the number of tokens"

4096 tokens * 6 chars per token = max 24,576 chunks size

That means that every chunk below 24,576 should work, but it is not the case