UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.73k stars 241 forks source link

Document creation time is not extracted when available #706

Closed aagubanov closed 8 months ago

aagubanov commented 1 year ago

I've identified several real cases when the creation date is stored inside a document but is not extracted by the library. The following causes have been noticed:

  1. The date is stored not as a literal but rather as a reference to another section. Possible fix (UglyToad.PdfPig.Parser.DocumentInformationFactory.cs, method Create):
foreach (KeyValuePair<string, IToken> pair in infoParsed.Data)
{
  IToken value = pair.Value;
  if (!(value is IndirectReferenceToken reference))
  {
    continue;
  }

  NameToken key = NameToken.Create(pair.Key);
  infoParsed = infoParsed.Without(key).With(key, DereferenceEntry(reference, pdfTokenScanner));
}

private static IToken DereferenceEntry(IToken value, IPdfTokenScanner pdfTokenScanner)
{
  return value is IndirectReferenceToken reference ? pdfTokenScanner.Get(reference.Data).Data : value;
}

Unfortunately I see no way to iterate over the token collection in a less ugly way.

  1. The timestamp contains a space between date and time.

Possible fix (UglyToad.PdfPig.Util.DateFormatHelper, method TryParseDateTimeOffset):

// Supporting formats like "YYYYMMDD HHmmSS"
s = s.Replace(" ", string.Empty);
  1. The year inside the timestamp occupies 5 digits rather than 4, e. g. 19101 what should mean the 101th year of the XX century (most probably the Y2K issue). Presumably such a corrupted timestamp is only typical for the first few years of the XXI century. Possible fix (UglyToad.PdfPig.Util.DateFormatHelper, method TryParseDateTimeOffset):
// Gets a year with check for an eventual Y2K issue. An incorrect year would have the following format:
//
// 19YYY
//
// where "YYY" is a number of the year in the XX century and is greater than 99 (hence requiring an extra digit).
// YYY would hardly be greater than, say, 105.
bool GetYear(ref int pos, out int year)
{
    // Getting a standard ISO-based year to return it whenever an Y2K-affected value is not identified
    if (!int.TryParse(s.Substring(pos, 4), out year))
    {
        // Invalid value
        pos += 4;
        return false;
    }

    // A standard ISO datetime value has 14 digits (fractions of a second are not expected)
    if (!HasRemainingCharacters(pos, 15))
    {
        pos += 4;
        return true;
    }

    string centuryStr = s.Substring(pos, 2);
    int century;

    if (!int.TryParse(centuryStr, out century) || century != 19)
    {
        pos += 4;
        return true;
    }

    string centuryYearStr = s.Substring(pos + 2, 3);
    int centuryYear;

    if (!int.TryParse(centuryYearStr, out centuryYear) || centuryYear < 100 || centuryYear > 105)
    {
        pos += 4; 
        return true;
    }

    pos += 5;
    year = century * 100 + centuryYear;
    return true;
}

Not sure how to share sample documents.

BobLd commented 1 year ago

@aagubanov thanks for raising this issue and the thorough explanation.

You should be able to share dmaamole documents by drag/dropping them in the comment you write

aagubanov commented 1 year ago

R3P0000007520_SMART M9_I LZBUCKLA.PDF EBOOK-DIETETYKA-SPORTOWA_copy_1.pdf pressrelease.pdf

aagubanov commented 1 year ago

@aagubanov thanks for raising this issue and the thorough explanation.

You should be able to share dmaamole documents by drag/dropping them in the comment you write

Thank you, it works.

BobLd commented 8 months ago

@aagubanov thanks for providing the documents. I've created a fix PR to handle indirect references in doc info factory.

I think the date format issue is out of scope for PdfPig though