empira / PDFsharp

PDFsharp and MigraDoc Foundation for .NET 6 and .NET Framework
https://docs.pdfsharp.net/
Other
492 stars 114 forks source link

Fix slowness in decoding hexadecimal string #89

Closed ranvis closed 6 months ago

ranvis commented 7 months ago

Char methods are Unicode compliant, and Char.ToUpper() is really slow probably because of it. On my test, while English text PDFs aren't visibly affected, Japanese ones are.

The PR removes Char.ToUpper() calls since AllowHexSpecifier supports lowercase characters.

ThomasHoevel commented 7 months ago

Char methods are Unicode compliant, and Char.ToUpper() is really slow probably because of it. On my test, while English text PDFs aren't visibly affected, Japanese ones are.

Are there test results that show how Japanese files are affected? The code only executes for hex numbers surrounded by "<" and ">" and if it encounters Japanese characters there then the file is corrupted. Looks like fixing a non-existent slowness.

ranvis commented 7 months ago

Whatever file should work, so here I just got a random PDF: https://www.nhk.or.jp/lesson/ja/pdf/textbook_appendix.pdf (1.1MB) SHA1(textbook_appendix.pdf)= 033de2fc706f39119d43dac0adbd2c64e48abb60

using PdfSharp.Pdf;
using PdfSharp.Pdf.IO;
using System.Diagnostics;

Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
using (PdfDocument doc = PdfReader.Open("textbook_appendix.pdf", PdfDocumentOpenMode.ReadOnly)) {
    stopwatch.Stop();
    Console.WriteLine($"Time: {stopwatch.ElapsedMilliseconds}ms");
}

original: Time: 364ms patched: Time: 72ms

There could be runtime culture dependency. But I don't think my runtime culture Japanese has one as rules are not defined in UCD SpecialCasing.

ranvis commented 7 months ago

if it encounters Japanese characters there

Characters inside <> in PDF should be all valid hex characters. The point is that Char.ToUpper() is Unicode compliant. The method doesn't know if input is ASCII hex or not, meaning it must inspect all characters for Unicode case conversion rules.

ThomasHoevel commented 7 months ago

The method doesn't know if input is ASCII hex or not, meaning it must inspect all characters for Unicode case conversion rules.

The code of Char.ToUpper is optimized for characters from the Latin1 base set. But yes, if it is not inlined then the jumps will consume some time. The difference is much bigger than I expected, even though no Japanese characters should be encountered at the locations you edited.

I'll discuss this with Stefan. Char.IsLetterOrDigit could also be a bit slow. Thanks for your feedback.

ThomasHoevel commented 7 months ago

I called PdfReader.Open 10 times in a loop and ran the program twice. The results for the original code: Time: 110,3479ms Time: 27,3819ms Time: 18,089399999999998ms Time: 20,707ms Time: 19,0002ms Time: 22,9284ms Time: 24,0744ms Time: 16,3004ms Time: 21,814400000000003ms Time: 15,713899999999999ms

Time: 90,4033ms Time: 26,3759ms Time: 17,6604ms Time: 20,6886ms Time: 19,09ms Time: 23,1262ms Time: 25,128ms Time: 16,6671ms Time: 29,165ms Time: 15,5417ms

The results for the patched code without Char.ToUpper and without Char.IsLetterOrDigit: Time: 85,71130000000001ms Time: 26,4195ms Time: 18,062600000000003ms Time: 22,696399999999997ms Time: 22,0221ms Time: 23,6846ms Time: 25,331200000000003ms Time: 16,683300000000003ms Time: 28,939300000000003ms Time: 15,5975ms

Time: 81,1105ms Time: 26,153200000000002ms Time: 18,0993ms Time: 22,8924ms Time: 19,1608ms Time: 22,9752ms Time: 24,875999999999998ms Time: 15,945ms Time: 33,0444ms Time: 15,2045ms

This clearly shows the power of the JIT compiler. The effect of the patch is rather small.

My code:

for (int x = 0; x < 10; ++x)
{
    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();
    using (PdfDocument doc = PdfReader.Open(@"textbook_appendix.pdf", PdfDocumentOpenMode.ReadOnly))
    {
        stopwatch.Stop();
        Console.WriteLine($"Time: {stopwatch.Elapsed.TotalSeconds * 1000d}ms");
    }
}
ranvis commented 7 months ago

Thank you for investigation. Yes, I got the similar result:

Time: 380ms Time: 22ms Time: 21ms Time: 16ms Time: 24ms Time: 20ms Time: 25ms Time: 16ms Time: 21ms Time: 17ms

Time: 80ms Time: 26ms Time: 19ms Time: 16ms Time: 26ms Time: 25ms Time: 25ms Time: 16ms Time: 20ms Time: 16ms

From the second iteration, I rather feel there's no difference. Still not sure why the first iteration is so slow on my PC...

ThomasHoevel commented 7 months ago

The first call includes the time for demand-loading the DLL files. Maybe fewer DLLs will be loaded without Char.ToUpper?

I added a call to Char.ToUpper before starting the StopWatch and I think it makes a difference.

for (int x = 0; x < 10; ++x)
{
    var ch = Char.ToUpper('x');
    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();
    using (PdfDocument doc = PdfReader.Open(@"\textbook_appendix.pdf", PdfDocumentOpenMode.ReadOnly))
    {
        stopwatch.Stop();
        Console.WriteLine($"Time: {stopwatch.Elapsed.TotalSeconds * 1000d}ms");
    }
}
ThomasHoevel commented 7 months ago

I tried this loop:

Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
var ch = Char.ToUpper('x');
{
    stopwatch.Stop();
    Console.WriteLine($"Time: {stopwatch.Elapsed.TotalSeconds * 1000d}ms");
}

And here are the results: Time: 13,8811ms Time: 0,0017000000000000001ms Time: 0,00039999999999999996ms Time: 0,0003ms Time: 0,0003ms Time: 0,0003ms Time: 0,00019999999999999998ms Time: 0,0003ms Time: 0,0003ms Time: 0,0003ms

Maybe the first execution takes even longer on your computer.

ranvis commented 7 months ago

Awesome. That was it. 🤦‍♂️

Time: 305.635ms Time: 0.0019ms Time: 0.0009ms Time: 0.0011ms Time: 0.0007999999999999999ms Time: 0.0007ms Time: 0.0007ms Time: 0.0007999999999999999ms Time: 0.0009ms Time: 0.0007999999999999999ms

StLange commented 6 months ago

Fixed in 6.1 preview 2.