empira / PDFsharp

PDFsharp and MigraDoc Foundation for .NET 6 and .NET Framework
https://docs.pdfsharp.net/
Other
492 stars 114 forks source link

Incorrect parsing of REVERSE SOLIDUS in literal string #154

Open Greybird opened 1 month ago

Greybird commented 1 month ago

Reporting an Issue Here

When parsing some files, I noticed some Info Elements are showing incorrect values. For example, for this file, the Producer tag:

Expected Behavior

When parsing literal string, when a REVERSE SOLIDUS is encountered with an immediate following character not part of Table 3 of 7.3.4.2 paragraph of ISO/DIS 32000-2, the REVERSE SOLIDUS should be ignored, but the following character should be kept.

Actual Behavior

When parsing literal string, when a REVERSE SOLIDUS is encountered with an immediate following character not part of Table 3 of 7.3.4.2 paragraph of ISO/DIS 32000-2, the REVERSE SOLIDUS is ignored, as well as the following character.

Steps to Reproduce the Behavior

[Fact]
public void ReverseSolidus_with_invalid_following_character_should_be_ignored()
{
    using var doc = PdfReader.Open(@"Cover-letter-4098208.pdf");
    var producer = doc.Info.Producer;
    producer.Should().Be("C48x Series (PDF - 300X300 dpi)");
}

Expected producer to be "C48x Series (PDF - 300X300 dpi)" with a length of 31, but "C48x Series (DF - 300X300 dpi)" has a length of 30, differs near "DF " (index 13).

The issue is most probably linked to an open question in the specification interpretation, as explained in this comment of Lexer.cs