PretrainedTokenizer::truncateHelper: prevent array_slice() error for flawed text input (summarization)

k00ni commented 2 months ago

What:

[x] Bug Fix

Description:

When running a summarization task, such as

$summarizer = pipeline('summarization', 'Xenova/distilbart-cnn-6-6');
$summary = $summarizer($text, maxNewTokens: 512, temperature: 0.7);

certain input may result in the following error in PretrainedTokenizer::truncateHelper:

array_slice(): Argument #1 ($array) must be of type array, null given

Bildschirmfoto vom 2024-05-16 17-10-12

My tests suggest that texts with lines, containing (almost?) no text, cause this problem (Example). Its not model-dependent.

Two tests were added to prove that the fix is working:

SummarizationPipelineTest: Integration test which checks behavior using a real model and some extracted text from a PDF (original source: https://arxiv.org/abs/2309.06888). I think there is a better way to accomplish the same test result, because the test runs 10+ seconds
PretrainedTokenizerTest: Unit test to check PretrainedTokenizer::truncateHelper itself. The input is flawed by design, which would trigger the error without the fix.

I marked the PR as draft because I would like to hear your opinion first.

CodeWithKyrian commented 2 months ago

Thank you so much for your contribution, @k00ni. I'm happy with the changes for the truncateHelper. However, I'm a bit undecided about the tests. While I appreciate the initiative to include tests with changes like this, I'm reluctant to accept them for now.

If you notice, there aren't many tests in the library, which I'm not proud of. This is because I want to take my time to decide on the best structure for the tests. Classes like Tensor and Image can be easily tested, and I included tests for the basic tokenizer because the config file sizes are relatively small. However, the overall testing structure of the library is still largely undecided.

For now, let's leave out the tests. Since you seem to have a keen interest in testing, I would really appreciate your suggestions on how best to structure tests for the project.

k00ni commented 2 months ago

I will reduce the PR to the fix and will write my feedback about the tests in a separate issue.

CodeWithKyrian commented 2 months ago

That's great. Looking forward to your suggestions

CodeWithKyrian / transformers-php

PretrainedTokenizer::truncateHelper: prevent array_slice() error for flawed text input (summarization) #36

What:

Description: