barasher / go-exiftool

Golang wrapper for Exiftool : extract as much metadata as possible (EXIF, ...) from files (pictures, pdf, office documents, ...)
GNU General Public License v3.0
250 stars 44 forks source link

ExtractMetadata becomes non-responsive after individual error #52

Closed jnathanh closed 2 years ago

jnathanh commented 2 years ago

Hi, thanks for sharing this package. I ran into an error case trying to do a filepath.Walk over my photos directory. Manually skipping directories does avoid this issue, but it seems like the wrong behavior for ExtractMetadata to silently start giving incorrect results as a result of a previous call. I'd like to incorporate this package into some of my workflows, so I'm raising the issue to help improve the package's usability/reliability. Here is a test that reproduces the behavior:

func TestNonResponsiveExiftool(t *testing.T) {
    t.Parallel()

    // create test directory with about 200MB worth of photos
    // 10,000 of the included jpg or 6 30MB CR2's both repro
    dirPath := "./testdata/extractkiller"

    err := os.Mkdir(dirPath, 0744)
    require.NoError(t, err)
    defer os.RemoveAll(dirPath)

    for i:=0;i < 10_000;i++ {
        err := copyFile("./testdata/20190404_131804.jpg", path.Join(dirPath, fmt.Sprintf("%d.jpg", i)))
        require.NoError(t, err)
    }

    // initialize exiftool
    e, err := NewExiftool(Debug())
    require.NoError(t, err)
    defer e.Close()

    // control case, everything working normally
    f := e.ExtractMetadata("./testdata/20190404_131804.jpg")
    assert.Equal(t, 1, len(f))
    assert.NoError(t, f[0].Err)

    // case that breaks the session, reading a directory does not seem to be supported by this package, which is okay, and I would expect it to throw an error in that case, but it should not make future reads invalid.
    f = e.ExtractMetadata(dirPath)
    assert.Error(t, f[0].Err)

    // control case no longer works
    f = e.ExtractMetadata("./testdata/20190404_131804.jpg")
    assert.Equal(t, 1, len(f))
    assert.NoError(t, f[0].Err)

}

I suspect the issue is either a race (scanning output before it's finished), or improper buffer usage (overflow maybe?). I ran the same test with exiftool directly and got output right away without issue.

One other note, it looks like exiftool supports paths to a directory as an input, or even a list of multiple files. It might be worth reworking the design to let exiftool handle the work of managing multiple paths (directories or files), and add support for parsing the multi-file output.

I'll try to work on a PR for this, but posting the issue now as a heads up, and to see if you have any suggestions on the right design to fix the issue.

barasher commented 2 years ago

Hi @jnathanh !

Since providing a folder makes the library bug, I think that the first thing to do is to reject folder paths.

I'm not convinced that handling folders is usefull. As a CLI it is (for exiftool), but as a library, if it does not increase performance, I think that keeping the library as simple as possible is better. What to you think ?

jnathanh commented 2 years ago

@barasher I agree it's best to stick with the simplest fix, and I don't personally need to pass the directory so I think your first instinct to leave out directory support for now is probably the right call.

I did a little more digging and found the root issue, the buffer is getting exhausted. In ExtractMetadata, when the Scan() returns false, we currently assume there are no errors and continue (not reporting the error and leaving the Scanner in a bad state). I happened to find it via loading multiple files (from a directory), but it could theoretically manifest for a single large metadata file.

I started on a fix, but it gets messy without breaking changes to the api. The obvious change is to report actual scan errors first before the "no content" error. What is not as straight-forward is how to address the bad scanner state. Here are the different options I have explored:

  1. replace bufio.Scanner with bufio.Reader (or really any io.Reader implementation)
    • The bufio package recommends using bufio.Reader for the sequential read use case (see the end of the description for Scanner). We can see why with the scanner state issue we are running into here.
    • This breaks the Buffer() option (how important is the use case for this option?)
  2. Reset the scanner
    • Breaks the Buffer() option, or you have to add a bunch of logic to check if the buffer was configured and somehow clear that byte array
  3. Exit early and report a bad client state
    • panic (kind of heavy-handed)
    • add an error return value to the signature (breaking change)
    • depend on the file-level error (this does not make it clear the client is no longer usable)

I suggest option 1 for the more reliable design, or option 3 for the simplest solution (and least api disruption).

barasher commented 2 years ago

Thanks for the analysis !

Backward compatibility is very important to me. I hate when libraries I use have breaking changes in their API :p Since I think that handling a directory is not that useful, I've introduced a new sentinel error ErrNotFile that can be return when invoking extractMetadata and I've chosen to implement the option 3 you proposed.

Here is the PR : https://github.com/barasher/go-exiftool/pull/55

What do you think ?

jnathanh commented 2 years ago

@barasher the directory check seems like a sensible feature given this library's lack of support for directories. However, it only addresses a symptom of the issue I was trying to raise. You could also trigger the same issue with a single file, if it is big enough to fill the scanner's buffer (see test below). This is likely an edge case since metadata files are usually small (at least the photo metadata I'm used to working with). So, you may want to leave it unresolved for now, however, I want to make sure you're aware that is the decision you are making by only addressing the issue with the directory check.

func TestLargeMetadata(t *testing.T) {
    t.Parallel()

    bigFile, cleanup := createFileLargerThan(t, bufio.MaxScanTokenSize)
    defer cleanup()

    e, err := NewExiftool()
    require.NoError(t, err)
    defer e.Close()

    // big file should not error, but if it does, it should output the actual source of the error
    bigResults := e.ExtractMetadata(bigFile)
    require.Len(t, bigResults, 1)
    assert.NoError(t, bigResults[0].Err)

    // subsequent small file should also not be affected by previous file reads
    smallResults := e.ExtractMetadata("testdata/20190404_131804.jpg")
    require.Len(t, smallResults, 1)
    assert.NoError(t, smallResults[0].Err)
}

func createFileLargerThan(t *testing.T, byteCount int) (path string, cleanup func()) {
    t.Helper()

    dirPath := t.TempDir()
    cleanup = func() { os.RemoveAll(dirPath) }

    path = filepath.Join(dirPath, "20190404_131804.jpg")
    require.Nil(t, copyFile("testdata/20190404_131804.jpg", path))

    e, err := NewExiftool()
    require.NoError(t, err)
    defer e.Close()

    mds := e.ExtractMetadata(path)
    require.Len(t, mds, 1)
    require.NoError(t, mds[0].Err)

    valueLargerThanBuffer := ""
    for i := 0; i < byteCount+1; i++ {
        valueLargerThanBuffer += "x"
    }

    mds[0].SetString("Comment", valueLargerThanBuffer)
    mds[0].Err = nil

    e.WriteMetadata(mds)
    require.NoError(t, mds[0].Err)

    return
}
barasher commented 2 years ago

I totally agree.

Nevertheless, currently, the Buffer(...) option can be used to customize go-exiftool's buffer. This way, we can extract any "amount" of metadata.

The only advantage I see using a bufio.Reader instead of a bufio.Scanner is the "streaming" (gradually) consumption of the buffer. This way, we might need a smaller buffer but the implementation will be really tricky.

jnathanh commented 2 years ago

Ya, I think using a larger buffer is an acceptable solution. The problem is that a new user, using the program defaults, would never find out that is the issue without extensive debugging. Ideally the program should report that the buffer filled up so a user know they need to use a larger buffer. Even worse, the output actually reports that the metadata was empty, so without investigating further, a user would think that there was just not metadata in the file.

Based on your feedback so far, I suggest adding a panic. That would make it clear what the issue is and not report back incorrect metadata results.

barasher commented 2 years ago

I created #56 to migrate from Scanner to Reader (and to return a cleaner response). Anyway, thanks a lot for the bug report ! Fix will be released in v1.7.1