UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.73k stars 241 forks source link

Performance issue with some pdf files using streams API - workaround? #465

Closed unwork-ag closed 1 year ago

unwork-ag commented 2 years ago

I have been starting to look for alternatives to iText7 for text extraction since iText has issues with some pdf documents that I need to handle which seem to have an uncommon (but still valid) encoding.

PdfPig can handle these files and overall provides pretty pleasing extraction results. However, when I run a benchmark using some of my example files, I see that there are some significant outliers. The file that I put [here] (https://drive.google.com/file/d/1-NZfDcUJvbpVUzAb9buCtUs3MWsYUGVT/view?usp=sharing) takes about 4 ms in iText7 but >1 second with PdfPig. I'm using the NearestNeighbourWordExtractor - but the results are pretty similar with the DefaultWordExtractor. For other files I have pretty reasonable results (28ms for 5 pages).

Any idea if I could configure the word extractor somehow to speed up the processing of this and similar files (using a Filter or FilterPivot delegate)?

BobLd commented 2 years ago

Hi @unwork-ag, thanks for the issue. Can you give us a rough idea of your pipeline? Any reason why you're using nearest neighbours instead of the default?

The 1sec is indeed very odd since your document does not contain many letters.

What kind of extraction method do you use in itext?

I'll have a deeper look tomorrow

unwork-ag commented 2 years ago

Hi @BobLd - thanks for having a look at this. The pipeline for the actual usage is basically and ASP.NET Core Service listening to messages about new files and then running various kinds of extraction. PDF extraction is one of the extractions. The words extracted are being added to a search index. So we have no intentions to create a readable text representation of the document, we just need the words to make the document searchable.

For benchmarking I used benchmarkdotnet. The extraction code is essentially this:

using var pdfDocument = UglyToad.PdfPig.PdfDocument.Open(fileStream);
foreach (var page in pdfDocument.GetPages())
{
      var letters = page.Letters;
      var words = _wordExtractor.GetWords(letters);

      foreach (var word in words)
      {
           // could add word separator handling here
           FilterAndCaptureText(word.Text);
      }
}

FilterAndCaptureText is to remove stop words and ensure that no duplicates are added (we want to keep the search index lean). _wordExtractor is a instance of the Default or NearestNeighborWordExtractor (both need >1 sec to extract the words out of this document).

I looked at >20 pdf files that are typical for our domain and compared the outcome of various extraction libs. With PdfPig and the default word extractor I got sometimes strange word results (word in reversed order and subsections of the word in reverse order) that I didn't get with the NearestNeighbor extractor. I also like the extension possibilities of the NearestNeighbor extractor.

With iText the extraction looks like this:

using PdfReader reader = new PdfReader(fileStream);
using var document = new PdfDocument(reader);

for (int pageNum = 1; pageNum <= document.GetNumberOfPages(); pageNum++)
{
    var page = document.GetPage(pageNum);
    var pageText =
        iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page, new CustomExtractionStrategy());

    FilterAndCaptureText(pageText);
}

The CustomExtractionStrategy is a slight modification of the original SimpleTextExtractionStrategy of iText to slightly modify the distance limit that is used to decide whether words are separate or not.

It would be nice to see whether there is actually a way to use PdfPig in a more efficient way (but still get the same kind of extraction results) for the kind of documents we have.

BobLd commented 2 years ago

Hi @unwork-ag, thanks for the answer.

I ran some benchmark on my side, and those are the results below.

We might not have the same config but I get to 10ms when opening the pdf and processing all the pages and (5ms when only processing the pages). So I'm not sure how you reach 1 second... Are you sure the time consuming part is not in theFilterAndCaptureText(word.Text); function?

// * Summary *

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1766 (21H2)
AMD Ryzen 7 4800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.301
  [Host]     : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT
  DefaultJob : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT

|                    Method |     Mean |     Error |    StdDev |
|-------------------------- |---------:|----------:|----------:|
|        GetAndProcessPages | 4.723 ms | 0.0409 ms | 0.0383 ms |
| OpenDocGetAndProcessPages | 9.980 ms | 0.1696 ms | 0.1586 ms |

// * Legends *
  Mean   : Arithmetic mean of all measurements
  Error  : Half of 99.9% confidence interval
  StdDev : Standard deviation of all measurements
  1 ms   : 1 Millisecond (0.001 sec)

// ***** BenchmarkRunner: End *****
// ** Remained 0 benchmark(s) to run **
Run time: 00:00:33 (33.23 sec), executed benchmarks: 2

Global total time: 00:00:36 (36.57 sec), executed benchmarks: 2
// * Artifacts cleanup *

This is the code I've use for the benchmark:

using BenchmarkDotNet.Attributes;
using UglyToad.PdfPig;
using UglyToad.PdfPig.DocumentLayoutAnalysis.WordExtractor;

[...]

public class PdfPigBenchmark : IDisposable
{
    private readonly NearestNeighbourWordExtractor _wordExtractor;
    private readonly PdfDocument _pdfDocument;

    public PdfPigBenchmark()
    {
        _wordExtractor = new NearestNeighbourWordExtractor();
        _pdfDocument = PdfDocument.Open(@"C:\Users\Bob\Document Layout Analysis\002-0702.pdf");
    }

    [Benchmark]
    public string[] GetAndProcessPages()
    {
        var pages = new List<string>();
        foreach (var page in _pdfDocument.GetPages())
        {
            var letters = page.Letters;
            var words = _wordExtractor.GetWords(letters);
            pages.AddRange(words.Select(x => x.Text.ToLower()).Distinct());
        }
        return pages.Distinct().ToArray();
    }

    [Benchmark]
    public string[] OpenDocGetAndProcessPages()
    {
        using (var pdfDocument = PdfDocument.Open(@"002-0702.pdf"))
        {
            var pages = new List<string>();
            foreach (var page in pdfDocument.GetPages())
            {
                var letters = page.Letters;
                var words = _wordExtractor.GetWords(letters);
                pages.AddRange(words.Select(x => x.Text.ToLower()).Distinct());
            }
            return pages.Distinct().ToArray();
        }
    }

    public void Dispose()
    {
        _pdfDocument.Dispose();
    }
}
unwork-ag commented 2 years ago

Hi @BobLd - thanks for the quick response! That's interesting - and strange. The FilterAndCaptureText method is used for all extractor libs that I benchmarked - and for iText7 and this file it has a mean of about 4ms.

But in fact the benchmark as you have shown it also runs on my machine with a mean of 13 ms (for OpenDocGetAndProcessPages). I will have to dig a bit deeper where this discrepancy comes from. I will post an update once I understand this better.

unwork-ag commented 2 years ago

Ok - I found the difference: I was passing in a file stream and you are passing in a path. If you modify the benchmark method to

        [Benchmark]
        public string[] OpenDocGetAndProcessPages()
        {
            var path = @"C:\ExtractData\PDF\Reference\002-0702.pdf";

            using var fileStream = File.OpenRead(path);
            using (var pdfDocument = PdfDocument.Open(fileStream))
            {
                var pages = new List<string>();
                foreach (var page in pdfDocument.GetPages())
                {
                    var letters = page.Letters;
                    var words = _wordExtractor.GetWords(letters);
                    pages.AddRange(words.Select(x => x.Text.ToLower()).Distinct());
                }
                return pages.Distinct().ToArray();
            }
        }

you will also get mean values close to a second. This seems to be something specific for this file. I have three other files I'm using in the benchmark where there is no noticeable difference when passing in a path or a stream. But for this one it's significant (and I have no idea why).

I can easily change my code to use a file path instead. But it might be interesting for you to look at this performance difference. Maybe you already have an idea ...

BobLd commented 2 years ago

@unwork-ag, thanks for the feedback. I confirm your findings below:

// * Summary *

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1766 (21H2)
AMD Ryzen 7 4800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.301
  [Host]     : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT
  DefaultJob : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT

|                          Method |       Mean |     Error |    StdDev |    Gen 0 |    Gen 1 |    Gen 2 | Allocated |
|-------------------------------- |-----------:|----------:|----------:|---------:|---------:|---------:|----------:|
|       OpenDocGetAndProcessPages |   9.960 ms | 0.0698 ms | 0.0583 ms | 656.2500 | 328.1250 | 109.3750 |      3 MB |
| OpenStreamDocGetAndProcessPages | 437.257 ms | 5.1057 ms | 3.9862 ms |        - |        - |        - |      3 MB |

// * Hints *
Outliers
  PdfPigBenchmark.OpenDocGetAndProcessPages: Default       -> 2 outliers were removed (10.19 ms, 10.26 ms)
  PdfPigBenchmark.OpenStreamDocGetAndProcessPages: Default -> 3 outliers were removed (459.73 ms..479.00 ms)

// * Legends *
  Mean      : Arithmetic mean of all measurements
  Error     : Half of 99.9% confidence interval
  StdDev    : Standard deviation of all measurements
  Gen 0     : GC Generation 0 collects per 1000 operations
  Gen 1     : GC Generation 1 collects per 1000 operations
  Gen 2     : GC Generation 2 collects per 1000 operations
  Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
  1 ms      : 1 Millisecond (0.001 sec)

// * Diagnostic Output - MemoryDiagnoser *

// ***** BenchmarkRunner: End *****
// ** Remained 0 benchmark(s) to run **
Run time: 00:00:30 (30.17 sec), executed benchmarks: 2

Global total time: 00:00:33 (33.5 sec), executed benchmarks: 2
// * Artifacts cleanup *

This is the code I used:

[MemoryDiagnoser] // we need to enable it in explicit way
public class PdfPigBenchmark
{
    private readonly NearestNeighbourWordExtractor _wordExtractor;

    private const string _path = @"C:\Users\Bob\Document Layout Analysis\002-0702.pdf";

    public PdfPigBenchmark()
    {
        _wordExtractor = new NearestNeighbourWordExtractor();
    }

    [Benchmark]
    public string[] OpenDocGetAndProcessPages()
    {
        using (var pdfDocument = PdfDocument.Open(_path))
        {
            var pages = new List<string>();
            foreach (var page in pdfDocument.GetPages())
            {
                var letters = page.Letters;
                var words = _wordExtractor.GetWords(letters);
                pages.AddRange(words.Select(x => x.Text.ToLower()).Distinct());
            }
            return pages.Distinct().ToArray();
        }
    }

    [Benchmark]
    public string[] OpenStreamDocGetAndProcessPages()
    {
        using (var fileStream = File.OpenRead(_path))
        using (var pdfDocument = PdfDocument.Open(fileStream))
        {
            var pages = new List<string>();
            foreach (var page in pdfDocument.GetPages())
            {
                var letters = page.Letters;
                var words = _wordExtractor.GetWords(letters);
                pages.AddRange(words.Select(x => x.Text.ToLower()).Distinct());
            }
            return pages.Distinct().ToArray();
        }
    }
}

I think it would be worth changing the issue title too.

@EliotJones do you have any idea why this happens? I'll try to give it a look today

BobLd commented 2 years ago

image

BobLd commented 2 years ago

My understanding is that the slowness comes from how StreamInputBytes work, especially the Peek(), Seek() and MoveNext() operations.

I guess this file in particular is much slower as PdfPig will use BruteForceSearcher, which relies a lot on Peek(), Seek() and MoveNext()...

@EliotJones: some optimisation could either be done in StreamInputBytes or in BruteForceSearcher... not sure what's best

There's a discussion related to that here: https://stackoverflow.com/questions/3998044/filestream-readbyte-inefficient-what-is-the-meaning-of-this

One possible short term solution for you is to convert your stream into a byte array. You'll get the same performance as directly opening the file (it will use ByteArrayInputBytes and not StreamInputBytes).

One possible implementation is as follow:

        [Benchmark]
        public string[] OpenStreamToBytesDocGetAndProcessPages()
        {
            using (var fileStream = File.OpenRead(_path))
            {
                byte[] buffer = new byte[fileStream.Length];         // <- 
                fileStream.Read(buffer, 0, (int)fileStream.Length);  // <-

                using (var pdfDocument = PdfDocument.Open(buffer))
                {
                    var pages = new List<string>();
                    foreach (var page in pdfDocument.GetPages())
                    {
                        [...]
                    }
                    return pages.Distinct().ToArray();
                }
            }
        }
BobLd commented 2 years ago

@EliotJones, forcing BruteForceSearcher to use a ByteArrayInputBytes removes all the slowness...

This is what I did, I know it's very hackish but you get the idea:

/// <summary>
/// Brute force search for all objects in the document.
/// </summary>
internal static class BruteForceSearcher
{
    private const int MinimumSearchOffset = 6;

    /// <summary>
    /// Find the offset of every object contained in the document by searching the entire document contents.
    /// </summary>
    /// <param name="bytes">The bytes of the document.</param>
    /// <returns>The object keys and offsets for the objects in this document.</returns>
    [NotNull]
    public static IReadOnlyDictionary<IndirectReference, long> GetObjectLocations(IInputBytes bytes)
    {
        if (bytes == null)
        {
            throw new ArgumentNullException(nameof(bytes));
        }

        // Using localBytes instead of bytes, and converting it to ByteArrayInputBytes
        IInputBytes localBytes = bytes;

        if (localBytes is StreamInputBytes)
        {
            byte[] buffer = new byte[localBytes.Length];
            localBytes.Read(buffer);
            localBytes = new ByteArrayInputBytes(buffer);
        }

        [...]
unwork-ag commented 2 years ago

Thanks @BobLd for the investigation. As I had mentioned before I can also use the Open method that takes a file path. That should resolve the performance issue for me. Nevertheless I will leave this issue open to allow you to track the streams related performance issue.

EliotJones commented 2 years ago

Thanks for all the investigation here @BobLd!

In general the stream approach will be slower however stream support is provided for situations where the memory/speed tradeoff is worthwhile for the consumer, e.g. they don't want to load the whole file to memory but speed is less important. PDFBox uses random access files to improve stream performance somewhat. In the past issues have discussed using a BufferedStream to improve performance but it looks like FileStream wraps this natively. Basically the tldr is it's a tradeoff for speed/memory usage that I don't have any particular plans to fix permanently but if someone is able to improve it in general through e.g. Spans I'm not opposed.

unwork-ag commented 2 years ago

@EliotJones : I understand the tradeoff. The odd thing is that for the file in question here the streams based approach is massively slower. For other PDFs (even with much more pages) the difference is rather neglectable. Here's the benchmark for the streams access for 4 different files (the PdfFileTag is built as: number_pages_sizeinKb).

Method PdfFileTag Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
ExtractMetadata 01_05_0011 28.15 ms 1.644 ms 4.716 ms 1000.0000 - - 8 MB
ExtractMetadata 02_02_0108 15.33 ms 0.299 ms 0.428 ms 1062.5000 625.0000 375.0000 6 MB
ExtractMetadata 03_63_1538 756.08 ms 14.706 ms 22.011 ms 46000.0000 23000.0000 7000.0000 264 MB
ExtractMetadata 04_03_0211 1,108.98 ms 21.901 ms 33.446 ms - - - 3 MB

And this one for the path based access:

Method PdfFileTag Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
ExtractMetadata 01_05_0011 30.30 ms 1.574 ms 4.640 ms 1000.0000 - - 8 MB
ExtractMetadata 02_02_0108 15.90 ms 0.312 ms 0.457 ms 1031.2500 562.5000 500.0000 6 MB
ExtractMetadata 03_63_1538 797.49 ms 15.939 ms 37.881 ms 46000.0000 25000.0000 7000.0000 264 MB
ExtractMetadata 04_03_0211 13.97 ms 0.273 ms 0.559 ms 531.2500 281.2500 93.7500 3 MB

For file 3 (with 63 pages and size 1538 kB) the performance is pretty similar. But for file 4 with 3 pages and 211 kB streams based access is almost a 100 times slower!

Nevertheless I understand that this issue doesn't have priority for you and I can easily switch to the path based interface.

EliotJones commented 1 year ago

Closing this since I'm trying to purge the backlog so I don't want to scream when I open the repo. The issue remains valid for investigation but I'm never going to have time to look into it unfortunately