Open FireballDWF opened 3 months ago
I too face the same issue. I tried to reduce the chunksize to 10000 even though getting the same error after training about 2 hours
I am also getting this error.
I was able to fix the issue by reducing the chunk size and chunk overlap to 5000 and 1000, respectively; 5000 was an assumption, I am sure it would still create the model with anything below 10000 (someone above observed that the model would not get created for a chunk size equaling 10000):
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 5000, # assumption
chunk_overlap = 1000, # overlap for continuity across chunks
)
docs = text_splitter.split_documents(document)
+1 to this issue. Now re-trying with @nmudkey000 fix (5000/1000).
same problem here, even using a chunks strategy that is below the max get the same error
Amazon says "...Use 6 characters per token as an approximation for the number of tokens"
4096 tokens * 6 chars per token = max 24,576 chunks size
That means that every chunk below 24,576 should work, but it is not the case
"Maximum input token count 4919 exceeds limit of 4096 for train data" in model-customization-job/amazon.titan-text-lite-v1:0:4k/nhjsh25oes0i in notebook 03_Model_customization/03_continued_pretraining_titan_text.ipynb