CodeWithKyrian / transformers-php

Transformers PHP is a toolkit for PHP developers to add machine learning magic to their projects easily.
https://codewithkyrian.github.io/transformers-php/
Apache License 2.0
291 stars 16 forks source link

PretrainedTokenizer::truncateHelper: prevent array_slice() error for flawed text input (summarization) #36

Closed k00ni closed 2 months ago

k00ni commented 2 months ago

What:

Description:

When running a summarization task, such as

$summarizer = pipeline('summarization', 'Xenova/distilbart-cnn-6-6');
$summary = $summarizer($text, maxNewTokens: 512, temperature: 0.7);

certain input may result in the following error in PretrainedTokenizer::truncateHelper:

array_slice(): Argument #1 ($array) must be of type array, null given

Bildschirmfoto vom 2024-05-16 17-10-12

My tests suggest that texts with lines, containing (almost?) no text, cause this problem (Example). Its not model-dependent.

Two tests were added to prove that the fix is working:

  1. SummarizationPipelineTest: Integration test which checks behavior using a real model and some extracted text from a PDF (original source: https://arxiv.org/abs/2309.06888). I think there is a better way to accomplish the same test result, because the test runs 10+ seconds

  2. PretrainedTokenizerTest: Unit test to check PretrainedTokenizer::truncateHelper itself. The input is flawed by design, which would trigger the error without the fix.


I marked the PR as draft because I would like to hear your opinion first.

CodeWithKyrian commented 2 months ago

Thank you so much for your contribution, @k00ni. I'm happy with the changes for the truncateHelper. However, I'm a bit undecided about the tests. While I appreciate the initiative to include tests with changes like this, I'm reluctant to accept them for now.

If you notice, there aren't many tests in the library, which I'm not proud of. This is because I want to take my time to decide on the best structure for the tests. Classes like Tensor and Image can be easily tested, and I included tests for the basic tokenizer because the config file sizes are relatively small. However, the overall testing structure of the library is still largely undecided.

For now, let's leave out the tests. Since you seem to have a keen interest in testing, I would really appreciate your suggestions on how best to structure tests for the project.

k00ni commented 2 months ago

I will reduce the PR to the fix and will write my feedback about the tests in a separate issue.

CodeWithKyrian commented 2 months ago

That's great. Looking forward to your suggestions