Shivanandroy / KeyPhraseTransformer

KeyPhraseTransformer lets you quickly extract key phrases, topics, themes from your text data with T5 transformer | Keyphrase extraction | Keyword extraction
MIT License
94 stars 13 forks source link

Does it work for very long documents? #9

Open VioletRaven opened 1 year ago

VioletRaven commented 1 year ago

Hello there, I am trying to make it work using the "mt5" model type since I want to use it on an italian dataset. Unfortunately, all the documents are longer that the max length supported by the model so I thought I would specify truncation = True, max_length = 512 when calling the split_into_paragraphs() function at wc_temp = len(self.tokenizer.tokenize(temp, max_length=512, truncation=True)) but this is not working -- Token indices sequence length is longer than the specified maximum sequence length for this model (6508 > 512). Running this sequence through the model will result in indexing errors.

Have you already found the solution to this problem?

Thank you in advance!