UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.73k stars 241 forks source link

Pdf Text remover generte pdf with error #538

Open mayurjansari opened 1 year ago

mayurjansari commented 1 year ago

image

PdfTextRemover.RemoveText(filePath).Save("notext.pdf")

input file

output file

fnatzke commented 1 year ago

@mayurjansair

Have fix.

\src\UglyToad.PdfPig\Writer\NoTextTokenWriter.cs

Change

if (!TryGetStreamWithoutText(streamToken, out var outputStreamToken))
{
                outputStreamToken = streamToken;
}

to be

        StreamToken outputStreamToken;
        if (streamToken.StreamDictionary.TryGet(NameToken.Type, out NameToken dictionaryTypeToken))
        {                
            outputStreamToken = streamToken;                 
        }
        else
        {                 
            if (!TryGetStreamWithoutText(streamToken, out outputStreamToken))
            {
                outputStreamToken = streamToken;
            }                 
        }

This will check if the StreamToken to be inspected is a PageContent stream with operations to be inspected for Text opertions or a Stream for something else (like XObjectImage).

Note some text still appears and I have confirmed if these are images of text or parts of a "form" object.

Attached

  1. code before and after. 2 example updated output (output2.pdf).
  2. screenshot side-by-side of before and after.

Example program to test:

  using UglyToad.PdfPig;
  using UglyToad.PdfPig.Writer;
  Console.WriteLine("PdfPig - Issue 538 - TextRemover");

  var filePathInput = @"C:\pdf\input.pdf";
  var pdfBytes = PdfTextRemover.RemoveText(filePathInput);

  var filePathOutput = @"C:\pdf\output2.pdf";
  File.WriteAllBytes(filePathOutput, pdfBytes );

Will check in fix (as it is pending research on remaining text) tomorrow.

Code before: CodeBefore

Code after: CodeAfter

Example of updated output pdf: output2.pdf

fnatzke commented 1 year ago

Found the additional text in XObject Form object.

Additional text removed on provided PDF.

Output-2023-01-14-TextRemoved

Testing of a large number of additon PDFs found a large nubmer of 'corner' cases. Inline image token writer missing writing Key-Value pairs between BI and ID.

InlineImageHighLevel-KeyValuePairsBetweenBIandID

However current issue is that not all streams have a type and streams without a type may not be a page content stream so trying to parse for ShowText and ShowTextWithPosition may appear in unsupported byte streams by chance resulting in corrupt streams. Currently remotetext is corrupting ICML03-081.pdf from test document folder.

Holding publishing 'fix' until more stable release.

fnatzke commented 1 year ago

Found:

  1. Add detection of text operator MoveToNextLineShowText.Symbol in NoTextTokenWriter in addtion to ShowText and ShowWithPositioning.
  2. Detect if stream has dictionary with Type: XObject and subType of Image and write stream as is (without parsing stream as graphics operations).
  3. More challenging is the Page Content streams have minimal streaming dictionary with only Length and usually Filter of FlateDecode. In earlier level methods that call down to NoTextTokenWriter need to provide context.
  4. Page content can be provided by an array of stream tokens. During open document these byte streams are merged together before parsing the entire stream.
  5. If stream dictionary marks as Type: Xobject and subType of rom then parse stream for graphics operations.
  6. If stream dictionary has BitPerComponent (but not type XObject) don't parse stream for graphics operations.
  7. If higher level WriteDictionary finds a dictionary of Type is Page and has a Content entry then set flag that next (anonymous) stream write (with bare stream dictionary) has graphics operations to parse.
  8. In PdfDocumentBuilder in AddPage if writing page contents the set flag next (anonymous) stream write (with basre stream dictionary) has graphics operations.

Have all documents in Integration test folder completed (except 'SPARC - v9 Architecture Manual.pdf' which seems to cause an infinate loop).

Testing out another ~1100 PDFs are hand and have found one 'mmc2114-1.pdf', when text removed, causes an unhandled exception on page 26.

Holding until 'fix' is stable.

fnatzke commented 1 year ago

Fixed remove text on 'SPARC - v9 Architecture Manual.pdf'' (from Integration test folder) by creating a thread and giving it 2MB stack; so just a large document.

Remove text on mmc2114-1.pdf is caused by following 'Beads' entry ('B') on first page object down each path until get to page which has Content stream in array of streams. The arrays have cut a TJ operator in half causing a parse error using the invidiual stream item in isolation.

Current thinking is to defer Bead entries until the end to allow page visits to consider the array as a while and find during the later write of bead entries that the object has already been copied. Might also help with deep traversals causing stack overflows.

PageObjectBeadsEntry

PageArrayContentStreamWithTruncatedTJop

mayurjansari commented 1 year ago

@fnatzke thank you.

Try this code. StreamToken outputStreamToken; if (streamToken.StreamDictionary.TryGet(NameToken.Type, out NameToken dictionaryTypeToken)) { outputStreamToken = streamToken; } else { if (!TryGetStreamWithoutText(streamToken, out outputStreamToken)) { outputStreamToken = streamToken; } } It is working on this type of files. I have some other files. now it showing image some files contain some private so give me email address so I can share file.

fnatzke commented 1 year ago

Thank you.

Like your thinking.

Do have this test (of Type in stream dictionary) and found I needed a lot more to cover most PDFs. After about 6 different test now have most PDFS working with no text. Out of 6000 PDFs I currently have 1 failing. It’s a large PDF using “beading”. This presents to notext writestteam after a very large number of recursive calls (maybe 45) a streamtoken with streamdictionary having only Filter and Length. The stream is partial and incomplete. (See screenshot from earlier post) starts with TJ. Call to parse in notext fails with exception.

Still looking for more PDFs to test.

Issue with your PDF is likely solved. Will check in intermediate code to forked repository tomorrow.

Email is fnatzke At hotmail.com

mvantzet commented 1 year ago

I might have been thinking too easily about the removal of text, sorry for the extra work!

mayurjansari commented 1 year ago

yes. try that code and found some pdf that mot working or not removed text. Just sent some of those files. I want to ask one question is there any way to remove all images from pdf?

fnatzke commented 1 year ago

Still more work to do but as promised published work in progress (see links below).

Road block at the moment is that Page content arrays have meaning in some documents and must be parsed and retained but in other documents the stream is incomplete (until merged). I made a big assumption I could merge the arrays and save to the first but empty out the others. That changes the meaning of (some) documents.

Current thinking would be try to parse the separate streams if we get a parser exception try merging .... then too hard to separate again (with possible big changes without text operations) so try to use existing approach; save merged stream to first array item and empty other array entries.

https://github.com/fnatzke/PdfPig/commit/c159bddbd09e537d15fbf4dd28ed05997c687914

ViewOnlyExampleNoTextUpdates

fnatzke commented 1 year ago

Latest in branch of fork at

https://github.com/fnatzke/PdfPig/tree/ViewOnlyExampleNoTextUpdates

Able to remove all text and images from example PDFs from @mayurjansari; opens in lib, Chrome and Adobe Reader without issue.

Changes:

  1. Removing images as well as text.
  2. Detects ApplyXObject events in graphic operation stream
  3. If XObject is an Image removes the operation from the stream.
  4. If XObject is a Form parses it's graphic operation stream (this picks up a lot more text and images)
  5. For Content streams that are in an array will detect if an individual stream will fail to parse then walk through the start of the byte stream looking for newline and start from there to parse.

On large test run of 1000s of PDFs however still corrupting several.

fnatzke commented 1 year ago

Latest

https://github.com/fnatzke/PdfPig/tree/ViewOnlyExampleNoTextUpdates

lates commit https://github.com/fnatzke/PdfPig/commit/2d76ea28b9aa0d0349ac6e886aadff9d3d452e66

Changes:

  1. ApplyXObject operation name lookup improvment (to detect if XObject type is Image or not) when page contents is written before top PageInfo loop due to following a link (eg Annotation, Beads etc) from earlier page. Look back for PageDictionary write that is underway rather than PageInfo for resources (in addition to any XObject Forms resources).

Fixed all of @mayurjansari latest and 3 of locally found.

Still have 16 PDFs that cause corruption by NoText. Seems cause is PageInfo resource is empty and may rely on Page parent or other pages to list Page resource XOjbect references. More to come.

mayurjansari commented 1 year ago

Thank You @fnatzke

I am checking latest commit.

fnatzke commented 1 year ago

@mayurjansari latest has No Text and No Image working in lib, Chrome and Adobe Reader.

I'm down to 10 PDFs that aren’t going to be solved with the current approach.

Removing images will fail as form objects are copied out of sequence.

These form objects have a graphic operation stream and refer to named XObjects that appears in nested forms. The nesting is not known unless the document is parsed in order.

So for “No text” is fine but "No image" will break on a small percentage of PDFs.

For these PDFs would need a large effort to develop a new preprocessor that runs before PDFDocumentBuilder that pulls in all pages in to memory creates a map showing the resource store state at the time of the (nested form) Apply XObject (“DO”) operator for use that in (NoText) token writer class (to know if a named XObject is an image or something else (eg Form)).

fnatzke commented 1 year ago

With mapping source document before no text able to find resource names in nested forms in 8 of the 10 outstanding PDFs. Down to 2 PDFs from the original ~1000 PDF files.

In a much wider test of 8386 different PDFs Success: 7964 (94%) (with notext as tested by library itself without exceptions however some seem to report errors on some pages when viewed in Adobe Reader) Failed with an exception during NoText: 89 (1%) Failed when checking the resulting PDF with no text (exception): 132 (1.5%) The balance are the originals PDFs had no text to start with and no attempt was made.

Have another collection of additional 6000 PDFs. Will then shortly then focus on fixes for those found with errors.

EliotJones commented 1 year ago

@mvantzet I probably won't have time to address these various edge cases, do you mind if I remove the implementation of the interface for now and you can pass it in from the client code?

mayurjansari commented 1 year ago

There are several class and methods are internal, so it won't work on client side of you remove the interface. So please don't remove.

mvantzet commented 1 year ago

Hi @EliotJones , I don't have an issue with removing the implementation. @fnatzke not sure if you are still working on these issues or not, if I can help out in any way let me know.