IBM / license-scanner

License Scanner
Apache License 2.0
6 stars 3 forks source link

Ignoring Non-Textual files in Directory-level scans #30

Open atharv-phadnis opened 1 year ago

atharv-phadnis commented 1 year ago

Hello,

We were trying to use the tool for directory-level scans (using --dir) over a bunch of cloned repositories. For instance, we tried scanning gitea, it results into following:

$ license-scanner --dir gitea/ Error: failed to normalize data: invalid input text with control characters

We had a similar observation on a few more directories containing some non-textual files such as UI assets, binaries, etc.

Will it be possible to get a Warning for such file occurrences, and those files being ignored, and the scanner continuing to scan the remaining files? Or perhaps a command-line argument to set such a behavior by the tool?

markstur commented 1 year ago

I had a workaround for this. There is a bit more to it that I need to untangle (probably not specific to this issue), but basically here (below) is where the error can be changed to log-and-continue.

I'll assign this to me. There is some a pending PR and some repo moving again that might delay this though.

===================================================================

diff --git a/normalizer/normalizer.go b/normalizer/normalizer.go
--- a/normalizer/normalizer.go  
+++ b/normalizer/normalizer.go  
@@ -151,7 +151,13 @@
    // Check if the text contains control characters indicative of binary or non-text files.
    // match against /[\u0000-\u0007\u000E-\u001B]/
    if ControlCharactersRE.MatchString(n.OriginalText) {
-       return fmt.Errorf("failed to normalize data: invalid input text with control characters")
+       if n.IsTemplate {
+           return fmt.Errorf("failed to normalize data: invalid input text with control characters")
+       } else {
+           Logger.Errorf("failed to normalize data: invalid input text with control characters")
+           n.NormalizedText = ""
+           return nil // continue
+       }
    }
atharv-phadnis commented 1 year ago

Hey @markstur, thanks for the prompt reply.

Tested your workaround, seemed to be sorting the issue for now. Also ran across another issue with similar outcome: Error: file too large (4986500 > 1000000)

I tried changes similar to what you suggested for the earlier issue, like so:

diff --git a/identifier/identifier.go b/identifier/identifier.go
index 4750fa7..7bb47bd 100644
--- a/identifier/identifier.go
+++ b/identifier/identifier.go
@@ -109,7 +109,8 @@ func IdentifyLicensesInFile(filePath string, options Options, licenseLibrary *li
                return IdentifierResults{}, err
        }
        if fi.Size() > 1000000 {
-               return IdentifierResults{}, fmt.Errorf("file too large (%v > 1000000)", fi.Size())
+               Logger.Errorf("file too large (%v > 1000000)", fi.Size())
+               return IdentifierResults{}, nil
        }

        b, err := ioutil.ReadFile(filePath)

Could you confirm if this is the right way of handling the problem, or should it have been something else? And also if it is possible to incorporate this change as well?