Closed ranvis closed 6 months ago
Char
methods are Unicode compliant, andChar.ToUpper()
is really slow probably because of it. On my test, while English text PDFs aren't visibly affected, Japanese ones are.
Are there test results that show how Japanese files are affected? The code only executes for hex numbers surrounded by "<" and ">" and if it encounters Japanese characters there then the file is corrupted. Looks like fixing a non-existent slowness.
Whatever file should work, so here I just got a random PDF: https://www.nhk.or.jp/lesson/ja/pdf/textbook_appendix.pdf (1.1MB) SHA1(textbook_appendix.pdf)= 033de2fc706f39119d43dac0adbd2c64e48abb60
using PdfSharp.Pdf;
using PdfSharp.Pdf.IO;
using System.Diagnostics;
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
using (PdfDocument doc = PdfReader.Open("textbook_appendix.pdf", PdfDocumentOpenMode.ReadOnly)) {
stopwatch.Stop();
Console.WriteLine($"Time: {stopwatch.ElapsedMilliseconds}ms");
}
original: Time: 364ms patched: Time: 72ms
There could be runtime culture dependency. But I don't think my runtime culture Japanese has one as rules are not defined in UCD SpecialCasing.
if it encounters Japanese characters there
Characters inside <>
in PDF should be all valid hex characters.
The point is that Char.ToUpper()
is Unicode compliant.
The method doesn't know if input is ASCII hex or not, meaning it must inspect all characters for Unicode case conversion rules.
The method doesn't know if input is ASCII hex or not, meaning it must inspect all characters for Unicode case conversion rules.
The code of Char.ToUpper
is optimized for characters from the Latin1 base set. But yes, if it is not inlined then the jumps will consume some time. The difference is much bigger than I expected, even though no Japanese characters should be encountered at the locations you edited.
I'll discuss this with Stefan. Char.IsLetterOrDigit
could also be a bit slow.
Thanks for your feedback.
I called PdfReader.Open
10 times in a loop and ran the program twice.
The results for the original code:
Time: 110,3479ms
Time: 27,3819ms
Time: 18,089399999999998ms
Time: 20,707ms
Time: 19,0002ms
Time: 22,9284ms
Time: 24,0744ms
Time: 16,3004ms
Time: 21,814400000000003ms
Time: 15,713899999999999ms
Time: 90,4033ms Time: 26,3759ms Time: 17,6604ms Time: 20,6886ms Time: 19,09ms Time: 23,1262ms Time: 25,128ms Time: 16,6671ms Time: 29,165ms Time: 15,5417ms
The results for the patched code without Char.ToUpper
and without Char.IsLetterOrDigit
:
Time: 85,71130000000001ms
Time: 26,4195ms
Time: 18,062600000000003ms
Time: 22,696399999999997ms
Time: 22,0221ms
Time: 23,6846ms
Time: 25,331200000000003ms
Time: 16,683300000000003ms
Time: 28,939300000000003ms
Time: 15,5975ms
Time: 81,1105ms Time: 26,153200000000002ms Time: 18,0993ms Time: 22,8924ms Time: 19,1608ms Time: 22,9752ms Time: 24,875999999999998ms Time: 15,945ms Time: 33,0444ms Time: 15,2045ms
This clearly shows the power of the JIT compiler. The effect of the patch is rather small.
My code:
for (int x = 0; x < 10; ++x)
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
using (PdfDocument doc = PdfReader.Open(@"textbook_appendix.pdf", PdfDocumentOpenMode.ReadOnly))
{
stopwatch.Stop();
Console.WriteLine($"Time: {stopwatch.Elapsed.TotalSeconds * 1000d}ms");
}
}
Thank you for investigation. Yes, I got the similar result:
Time: 380ms Time: 22ms Time: 21ms Time: 16ms Time: 24ms Time: 20ms Time: 25ms Time: 16ms Time: 21ms Time: 17ms
Time: 80ms Time: 26ms Time: 19ms Time: 16ms Time: 26ms Time: 25ms Time: 25ms Time: 16ms Time: 20ms Time: 16ms
From the second iteration, I rather feel there's no difference. Still not sure why the first iteration is so slow on my PC...
The first call includes the time for demand-loading the DLL files.
Maybe fewer DLLs will be loaded without Char.ToUpper
?
I added a call to Char.ToUpper
before starting the StopWatch and I think it makes a difference.
for (int x = 0; x < 10; ++x)
{
var ch = Char.ToUpper('x');
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
using (PdfDocument doc = PdfReader.Open(@"\textbook_appendix.pdf", PdfDocumentOpenMode.ReadOnly))
{
stopwatch.Stop();
Console.WriteLine($"Time: {stopwatch.Elapsed.TotalSeconds * 1000d}ms");
}
}
I tried this loop:
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
var ch = Char.ToUpper('x');
{
stopwatch.Stop();
Console.WriteLine($"Time: {stopwatch.Elapsed.TotalSeconds * 1000d}ms");
}
And here are the results: Time: 13,8811ms Time: 0,0017000000000000001ms Time: 0,00039999999999999996ms Time: 0,0003ms Time: 0,0003ms Time: 0,0003ms Time: 0,00019999999999999998ms Time: 0,0003ms Time: 0,0003ms Time: 0,0003ms
Maybe the first execution takes even longer on your computer.
Awesome. That was it. 🤦♂️
Time: 305.635ms Time: 0.0019ms Time: 0.0009ms Time: 0.0011ms Time: 0.0007999999999999999ms Time: 0.0007ms Time: 0.0007ms Time: 0.0007999999999999999ms Time: 0.0009ms Time: 0.0007999999999999999ms
Fixed in 6.1 preview 2.
Char
methods are Unicode compliant, andChar.ToUpper()
is really slow probably because of it. On my test, while English text PDFs aren't visibly affected, Japanese ones are.The PR removes
Char.ToUpper()
calls sinceAllowHexSpecifier
supports lowercase characters.