Open mayurjansari opened 1 year ago
@mayurjansair
Have fix.
\src\UglyToad.PdfPig\Writer\NoTextTokenWriter.cs
Change
if (!TryGetStreamWithoutText(streamToken, out var outputStreamToken))
{
outputStreamToken = streamToken;
}
to be
StreamToken outputStreamToken;
if (streamToken.StreamDictionary.TryGet(NameToken.Type, out NameToken dictionaryTypeToken))
{
outputStreamToken = streamToken;
}
else
{
if (!TryGetStreamWithoutText(streamToken, out outputStreamToken))
{
outputStreamToken = streamToken;
}
}
This will check if the StreamToken to be inspected is a PageContent stream with operations to be inspected for Text opertions or a Stream for something else (like XObjectImage).
Note some text still appears and I have confirmed if these are images of text or parts of a "form" object.
Attached
Example program to test:
using UglyToad.PdfPig;
using UglyToad.PdfPig.Writer;
Console.WriteLine("PdfPig - Issue 538 - TextRemover");
var filePathInput = @"C:\pdf\input.pdf";
var pdfBytes = PdfTextRemover.RemoveText(filePathInput);
var filePathOutput = @"C:\pdf\output2.pdf";
File.WriteAllBytes(filePathOutput, pdfBytes );
Will check in fix (as it is pending research on remaining text) tomorrow.
Code before:
Code after:
Example of updated output pdf: output2.pdf
Found the additional text in XObject Form object.
Additional text removed on provided PDF.
Testing of a large number of additon PDFs found a large nubmer of 'corner' cases. Inline image token writer missing writing Key-Value pairs between BI and ID.
However current issue is that not all streams have a type and streams without a type may not be a page content stream so trying to parse for ShowText and ShowTextWithPosition may appear in unsupported byte streams by chance resulting in corrupt streams. Currently remotetext is corrupting ICML03-081.pdf from test document folder.
Holding publishing 'fix' until more stable release.
Found:
Have all documents in Integration test folder completed (except 'SPARC - v9 Architecture Manual.pdf' which seems to cause an infinate loop).
Testing out another ~1100 PDFs are hand and have found one 'mmc2114-1.pdf', when text removed, causes an unhandled exception on page 26.
Holding until 'fix' is stable.
Fixed remove text on 'SPARC - v9 Architecture Manual.pdf'' (from Integration test folder) by creating a thread and giving it 2MB stack; so just a large document.
Remove text on mmc2114-1.pdf is caused by following 'Beads' entry ('B') on first page object down each path until get to page which has Content stream in array of streams. The arrays have cut a TJ operator in half causing a parse error using the invidiual stream item in isolation.
Current thinking is to defer Bead entries until the end to allow page visits to consider the array as a while and find during the later write of bead entries that the object has already been copied. Might also help with deep traversals causing stack overflows.
@fnatzke thank you.
Try this code.
StreamToken outputStreamToken; if (streamToken.StreamDictionary.TryGet(NameToken.Type, out NameToken dictionaryTypeToken)) { outputStreamToken = streamToken; } else { if (!TryGetStreamWithoutText(streamToken, out outputStreamToken)) { outputStreamToken = streamToken; } }
It is working on this type of files. I have some other files. now it showing
some files contain some private so give me email address so I can share file.
Thank you.
Like your thinking.
Do have this test (of Type in stream dictionary) and found I needed a lot more to cover most PDFs. After about 6 different test now have most PDFS working with no text. Out of 6000 PDFs I currently have 1 failing. It’s a large PDF using “beading”. This presents to notext writestteam after a very large number of recursive calls (maybe 45) a streamtoken with streamdictionary having only Filter and Length. The stream is partial and incomplete. (See screenshot from earlier post) starts with TJ. Call to parse in notext fails with exception.
Still looking for more PDFs to test.
Issue with your PDF is likely solved. Will check in intermediate code to forked repository tomorrow.
Email is fnatzke At hotmail.com
I might have been thinking too easily about the removal of text, sorry for the extra work!
yes. try that code and found some pdf that mot working or not removed text. Just sent some of those files. I want to ask one question is there any way to remove all images from pdf?
Still more work to do but as promised published work in progress (see links below).
Road block at the moment is that Page content arrays have meaning in some documents and must be parsed and retained but in other documents the stream is incomplete (until merged). I made a big assumption I could merge the arrays and save to the first but empty out the others. That changes the meaning of (some) documents.
Current thinking would be try to parse the separate streams if we get a parser exception try merging .... then too hard to separate again (with possible big changes without text operations) so try to use existing approach; save merged stream to first array item and empty other array entries.
https://github.com/fnatzke/PdfPig/commit/c159bddbd09e537d15fbf4dd28ed05997c687914
Latest in branch of fork at
https://github.com/fnatzke/PdfPig/tree/ViewOnlyExampleNoTextUpdates
Able to remove all text and images from example PDFs from @mayurjansari; opens in lib, Chrome and Adobe Reader without issue.
Changes:
On large test run of 1000s of PDFs however still corrupting several.
Latest
https://github.com/fnatzke/PdfPig/tree/ViewOnlyExampleNoTextUpdates
lates commit https://github.com/fnatzke/PdfPig/commit/2d76ea28b9aa0d0349ac6e886aadff9d3d452e66
Changes:
Fixed all of @mayurjansari latest and 3 of locally found.
Still have 16 PDFs that cause corruption by NoText. Seems cause is PageInfo resource is empty and may rely on Page parent or other pages to list Page resource XOjbect references. More to come.
Thank You @fnatzke
I am checking latest commit.
@mayurjansari latest has No Text and No Image working in lib, Chrome and Adobe Reader.
I'm down to 10 PDFs that aren’t going to be solved with the current approach.
Removing images will fail as form objects are copied out of sequence.
These form objects have a graphic operation stream and refer to named XObjects that appears in nested forms. The nesting is not known unless the document is parsed in order.
So for “No text” is fine but "No image" will break on a small percentage of PDFs.
For these PDFs would need a large effort to develop a new preprocessor that runs before PDFDocumentBuilder that pulls in all pages in to memory creates a map showing the resource store state at the time of the (nested form) Apply XObject (“DO”) operator for use that in (NoText) token writer class (to know if a named XObject is an image or something else (eg Form)).
With mapping source document before no text able to find resource names in nested forms in 8 of the 10 outstanding PDFs. Down to 2 PDFs from the original ~1000 PDF files.
In a much wider test of 8386 different PDFs Success: 7964 (94%) (with notext as tested by library itself without exceptions however some seem to report errors on some pages when viewed in Adobe Reader) Failed with an exception during NoText: 89 (1%) Failed when checking the resulting PDF with no text (exception): 132 (1.5%) The balance are the originals PDFs had no text to start with and no attempt was made.
Have another collection of additional 6000 PDFs. Will then shortly then focus on fixes for those found with errors.
@mvantzet I probably won't have time to address these various edge cases, do you mind if I remove the implementation of the interface for now and you can pass it in from the client code?
There are several class and methods are internal, so it won't work on client side of you remove the interface. So please don't remove.
Hi @EliotJones , I don't have an issue with removing the implementation. @fnatzke not sure if you are still working on these issues or not, if I can help out in any way let me know.
PdfTextRemover.RemoveText(filePath).Save("notext.pdf")
input file
output file