JiffyChen / opendlp

Automatically exported from code.google.com/p/opendlp
0 stars 0 forks source link

Regex works for txt file but not Word or Excel #55

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?

1.Create 3 files; Txt, Word and Excel, and enter the following in a line or row:

    1234   ABCD   123-45-6789

2. Create 2 regexes:

   Name: Search23      regex:   23
   Name: SearchABCD    regex:  [Aa][Bb][Cc][Dd]

3. Create a profile and put a check on the following fields:

   Search23
   SearchABCD
   Social_Security_number_dashes

4. Then run a scan

What is the expected output? What do you see instead?

Should be able to find those strings ('23','ABCD' and SSN# ) in all 3 files. I 
only see txt file in the scan result but not in Word and Excel files.

What version of the product are you using? On what operating system?

OpenDLP 0.4.4. Windows 7 and Firefox 10.0.1

Please provide any additional information below.

Please see screenshots attached.

I am sure I overlooked something, but I couldn't figure out :-(

Thank you in advance

Tom

Original issue reported on code.google.com by tomh...@gmail.com on 27 Mar 2012 at 1:16

Attachments:

GoogleCodeExporter commented 8 years ago
Attached please find intense verbose log.  Starting line 67, processing docx 
began and found no match, then line 77 began processing txt file, and found 3 
matches, line 141 began processing xlsx file and found no match.

I did test pdf file and search for social security number and it worked.

Please advice

Thanks

Tom

Original comment by tomh...@gmail.com on 27 Mar 2012 at 12:28

Attachments:

GoogleCodeExporter commented 8 years ago
Hi Andrew,

This issue happened to 2007 Word (.docx) and Excel(.xlsx). The older version of 
Word (.doc) and Excel(.xls) work great. I haven't verified Power Point, Visio 
... any new finding will be posted here.

Thanks

Tom 

Original comment by tom...@ogilvy.com on 27 Mar 2012 at 11:34

GoogleCodeExporter commented 8 years ago
Hi Tom,

I was not able to reproduce this bug. I used your regexes and sample files, and 
the agent scanner was able to find those strings.

Office 2007 files (DOCX, XLSX, PPTX, etc) are really ZIP files. In the log file 
you attached, I didn't see where the agent unzipped the DOCX and XLSX files. In 
your profile, did you remove those file extensions from the "ZIP Extensions" 
section? If so, update your profile to include those extensions and it should 
find the data.

Original comment by andrew.O...@gmail.com on 2 Apr 2012 at 9:51

GoogleCodeExporter commented 8 years ago
Hi Andrew,

You are 100% correct. Once I entered docx, xlsx and pptx into Zip Extensions, 
and ran the scan. OpenDLP found them all. Awesome.

I very much appreciate you taking time to test and give OpenDLP for free :-)

Regards

Tom

Original comment by tomh...@gmail.com on 2 Apr 2012 at 5:22

GoogleCodeExporter commented 8 years ago

Original comment by andrew.O...@gmail.com on 2 Apr 2012 at 7:54