View the GitHub project here or download the latest release here.
It is highly recommended that you review all redactions generated by this script, no automated process is perfect! It is possible that the below mentioned logic could yield odd redactions when the quality of the source text is less uniform, for example, an image which has been OCR'ed. There can also be instances where Aspose seems to report not fully accurate coordinates for text found in a PDF, causing the resulting markups to be off.
At the time of this script's creation, the API does not offer functionality to perform bulk redactions, something that is present in the Nuix Workstation user interface. Using functionality built in to SuperUtilities.jar this script provides the ability to generate bulk redactions based on:
The script leverages Aspose (distributed with Nuix) to search the content of PDFs for text matches. The coordinates of each match are converted to a form which can then be used to apply redactions to items in your Nuix case. Since Aposes needs a PDF file to work from for the searching, this script will export temporary PDF files for each item processed.
Begin by downloading the latest release of this code. Extract the contents of the archive into your Nuix scripts directory. In Windows the script directory is likely going to be either of the following:
%appdata%\Nuix\Scripts
- User level script directory%programdata%\Nuix\Scripts
- System level script directoryBegin by selecting some items in the results view, then run the script. A settings dialog will be presented.
On this tab you select either the name of an existing markup set (if any already exist in the case) or provide the name of a new markup set to be created by the script. This markup set will be where redactions (markups) will be added to.
On this tab you also provide a temp directory. This temp directory is where the script will export PDF files which will be used by Aspose for determining text position data.
On this tab you may provide regular expressions which will be used to locate text to be redacted.
On this tab you may provide single terms or entire phrases. All text searches are submit to Aspose as regular expressions. Terms and phrases provided on this tab are converted to regular expressions internally by the script using the following steps:
\{}.^$()-*?|<>[]
Cat
becomes [Cc][Aa][Tt]
. This is because, while the Aspose API accepts regular expressions, there does not seem to be a way to tell it to locate them in a case insensitive manner.\s+
(match 1 or more whitespace characters).\b
(anchor to word boundary).Here are some example input terms/phrases and the resulting expressions they yield.
Input | Resulting Expression |
---|---|
C:\ImportantData\Spreadsheet.xlsx |
\b[Cc]:\\[Ii][Mm][Pp][Oo][Rr][Tt][Aa][Nn][Tt][Dd][Aa][Tt][Aa]\\[Ss][Pp][Rr][Ee][Aa][Dd][Ss][Hh][Ee][Ee][Tt]\.[Xx][Ll][Ss][Xx]\b |
randomized fake data |
\b[Rr][Aa][Nn][Dd][Oo][Mm][Ii][Zz][Ee][Dd]\s+[Ff][Aa][Kk][Ee]\s+[Dd][Aa][Tt][Aa]\b |
The Lazy Cat |
\b[Tt][Hh][Ee]\s+[Ll][Aa][Zz][Yy]\s+[Cc][Aa][Tt]\b |
[REPLY] |
\b\[[Rr][Ee][Pp][Ll][Yy]\]\b |
1-555-555-1234 |
1\-555\-555\-1234 |
On this tab you may select named entites you wish to have the matches of redacted. For each selected named entity type and a given item, all the named entity match values will be collected and then converted to expressions using the workflow outlined above for terms and phrases.
To find text, the script makes use of the Aspose class TextFragmentAbsorber. This in turn provides a series of TextFragment objects. Each TextFragment contains information about the text matched as well as a bounding box that encompasses that text. If the matched text wraps to a new line in the PDF, the bounding box provided would cover the entirety of both lines.
The script deals with this by going deeper and inspecting each TextSegment in the fragment (essentially each individual character). The script then groups TextSegments by the line they are on, as determined by the value of the bounding box lower left Y coordinate rounded to 2 decimal places. Then within each per-line group, TextSegments are ordered by the lower left X coordinate. Multiple TextSegments on a given line are then converted into a single bounding box, which is then used to generate the appropriate redaction in Nuix.
This extra logic means:
This script relies on code from Nx to present a settings dialog and progress dialog. This JAR file is not included in the repository (although it is included in release downloads). If you clone this repository, you will also want to obtain a copy of Nx.jar by either:
Once you have a copy of Nx.jar, make sure to include it in the same directory as the script.
This script also relies on code from SuperUtilities, which contains the code for performing the redactions. This JAR file is not included in the repository (although it is included in release downloads). If you clone this repository, you will also want to obtain a copy of SuperUtilities.jar by either:
Once you also have a copy of SuperUtilities.jar, make sure to include it in the same directory as the script.
Copyright 2021 Nuix
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.