Allow for streaming operation

j4m3z0r commented 6 years ago

Hi there, I'm trying to use MetadataExtractor on an input that is being streamed over the network. I've implemented a System.IO.Stream subclass that reads data from the network and buffers data that has already been read to allow seeking to work.

Unfortunately, MetadataExtractor seems to read in the entire file before processing any of the metadata. It looks like it might be this line that is the culprit:

https://github.com/drewnoakes/metadata-extractor-dotnet/blob/master/MetadataExtractor/Formats/Tiff/TiffReader.cs#L67

I note that there is already a comment there about this exact issue with a TODO on it. Is there some way to fix this without breaking anything? Alternatively, how about adding an option to allow for stream-based processing and just skip that check when it's set?

drewnoakes commented 6 years ago

Hey James,

Thanks for the report and the research. You're right, reading the length is a problem. I think this is the cause of #111 too.

The line you highlight is only there to workaround rare and invalid images. It seems wrong to pay this cost for valid images.

However there are several usages of Length in TiffReader. All of these are used for validation, and they are hit quite often.

Running the code across the sample image library, we see the following tally for times that reading the Length prevented code going wrong.

Line	Hit Count	Purpose
67	0	First IFD offset OOB (attempt fix)
112	2	IFD offset OOB (halt)
132	12	IFD entries would go OOB (halt)
184	230	IFD entry pointer OOB (skip entry)
197	0	IFD entry truncated (skip)
205	0	IFD entry pointer OOB (skip entry) ¹
239	39	4 bytes after IFD point OOB (halt)

¹ This is actually a duplicate of L184, which is why it's never hit. I'll look at tidying that up a bit.

The cost of ignoring these valdiation checks could be OOBE, OOME, or just garbage being produced in the output. It's the inability to filter garbage that makes disabling validation a problem to my mind.

These numbers are just for TIFF too (Exif, TIFF, raw files, ...) and there are other usages of Stream.Length for other file types too.

Unless I'm missing a creative solution, it just isn't possible to reliably process TIFF data in a streaming format. It would be possible to architect TIFF data so that it were possible, however in practice the TIFF standard makes no such guarantees around layout.

In your case, can you work around the issue by providing the Length somehow? For example, if you're using HTTP then maybe you have a Content-Length header.

As an aside: the library's IndexedCapturingReader class sounds similar to your custom Stream. If there's some difference there, I'd be interested to hear about it. However that class will also have the problem you describe.

j4m3z0r commented 6 years ago

Hey Drew, thanks for the thoughtful and prompt reply!

If the Length calls ultimately resulted in querying the Stream object's Length property, this might be a non-issue -- I could either use Content-Length, or just return an impossibly large number to ensure those checks pass. However, my existing implementation of Length throws a NotImplementedException, and I'm not seeing that get thrown -- perhaps MetadataExtractor is swallowing that exception somewhere and falling back to seeking to the end of the stream to determine its length?

Regarding processing TIFF data in streaming mode, I agree that it's not possible in the general case -- TIFF's IFD offsets can point to anywhere. However, I submit that it's possible to do on most TIFF files you see in practice (annoyingly not all, but I'll return to that). I've implemented a streaming preview extractor with a minimal TIFF IFD parser that works like so:

Wrap the HttpContent instance in an object that stores bytes already read into an internal buffer, and use that buffer for cases where the caller requests data "in the past".
If the next byte that is requested is close to the end of the buffer (I've used 1MB as my threshold), read that much data from the HttpContent object, populating the internal buffer as we go, and then copy the requested data from the internal buffer.
If the next requested byte is beyond that threshold, perform a HTTP Range request to get that data. The cases where this is necessary always seem to be at the end of the file, and I'm yet to find a case where it's necessary to read more than 16KB to be able to parse the TIFF metadata -- this seems to be enough to get it to work for the annoying not-perfectly-streamable TIFF (well, raw) files. I also have a tail buffer system so that multiple requests to the end of the file don't result in multiple HTTP range requests.

In practice, most camera makers seem to opt to put the metadata at the start of the file so that you can do things like display thumbnails without having to load the entire raw file.

However, that's a layer of complexity that I don't think MetadataExtractor needs to worry about. The Stream interface defines seeking, reading and length operations, which can all be implemented on top of such a buffering construct. It seems like the root issue might just be that Length on IndexedCapturingReader does not forward to the Stream instance it wraps.

Perhaps something like this on IndexedCapturingReader would work:

        public override long Length
        {
            get
            {
                try
                {
                    return _stream.Length;
                }
                catch(NotSupportedException)
                {
                    IsValidIndex(int.MaxValue, 1);
                    Debug.Assert(_isStreamFinished);
                    return _streamLength;
                }
            }
        }

What do you think?

j4m3z0r commented 6 years ago

I tried this out and it seems to work. Just sent a PR with exactly that change. I tried to run the test-suite, but got stuck building it with the following message: The current .NET SDK does not support targeting .NET Standard 2.0. Either target .NET Standard 1.6 or lower, or use a version of the .NET SDK that supports .NET Standard 2.0.. If there's an easy fix for this (I'm on Linux; the build script looks to be Windows only, so not sure how to proceed with debugging), I'm happy to try again, but I figured that this change was pretty unlikely to break anything since it worked in my application.

drewnoakes commented 6 years ago

Thanks for the PR. Left one comment to consider.

As for the netstandard2.0 issue, it's not something I'm hugely familiar with. I want to support as wide a range of platforms as possible, ideally ranging back to net35 (see #121) but it seems to be causing others problems. I spend most of my time on net4* on Windows so don't hit too many of these problems.

What is the output of the following command on your machine?

$ dotnet --info
.NET Command Line Tools (2.1.4)

Product Information:
 Version:            2.1.4
 Commit SHA-1 hash:  5e8add2190

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.16299
 OS Platform: Windows
 RID:         win10-x64
 Base Path:   C:\Program Files\dotnet\sdk\2.1.4\

Microsoft .NET Core Shared Framework Host

  Version  : 2.0.5
  Build    : 17373eb129b3b05aa18ece963f8795d65ef8ea54

j4m3z0r commented 6 years ago

Arh, for some reason rebuilding now as part of the solution I was using to test with now fails with the same message.

$ dotnet --info
.NET Command Line Tools (1.1.0)

Product Information:
 Version:            1.1.0
 Commit SHA-1 hash:  d6f4336106

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  17.10
 OS Platform: Linux
 RID:         ubuntu.16.10-x64
 Base Path:   /usr/share/dotnet/sdk/1.1.0

For reference, my dotnet binary seems to have come from Microsoft's dotnet-host package:

$ dpkg -S `which dotnet`
dotnet-host: /usr/bin/dotnet
$ apt show dotnet-host
Package: dotnet-host
Version: 2.0.0-preview2-25407-01-1
Status: install ok installed
Priority: standard
Section: libs
Maintainer: Microsoft <dotnetcore@microsoft.com>
Installed-Size: 141 kB
Depends: libc6 (>= 2.4), libgcc1 (>= 1:3.0), libstdc++6 (>= 4.8)
Conflicts: dotnet, dotnet-nightly
Homepage: https://dotnet.github.io/core
Download-Size: unknown
APT-Manual-Installed: no
APT-Sources: /var/lib/dpkg/status
Description: Microsoft .NET Core Host - 2.0.0 Preview 2
 .NET Core is a development platform that you can use to build command-line applications, microservices and modern websites. It is open source, cross-platform and is supported by Microsoft. We hope you enjoy using it! If you do, please consider joining the active community of developers that are contributing to the project on GitHub (https://github.com/dotnet/core). We happily accept issues and PRs.

For reference, when I tried to build the test program, I did so using msbuild, which appears to be a Xamarin package:

$ dpkg -S `which msbuild`
msbuild: /usr/bin/msbuild
$ apt show msbuild
Package: msbuild
Version: 1:15.4+xamarinxplat.2017.09.14.16.14-0xamarin2+ubuntu1404b1
Status: install ok installed
Priority: optional
Section: unknown
Maintainer: Jo Shields <joshield@microsoft.com>
Installed-Size: 42.1 MB
Depends: mono-runtime (>= 3.0~), libmono-corlib4.5-cil (>= 4.0.0~alpha1), libmono-microsoft-build-framework4.0-cil (>= 3.6.0), libmono-microsoft-csharp4.0-cil (>= 1.0), libmono-system-componentmodel-composition4.0-cil (>= 3.0.6), libmono-system-configuration4.0-cil (>= 4.0.0~alpha1), libmono-system-core4.0-cil (>= 4.0.0~alpha1), libmono-system-data4.0-cil (>= 4.0.0~alpha1), libmono-system-drawing4.0-cil (>= 3.0.6), libmono-system-identitymodel4.0-cil (>= 4.0.0~alpha1), libmono-system-io-compression-filesystem4.0-cil (>= 3.2.1), libmono-system-io-compression4.0-cil (>= 3.2.1), libmono-system-net-http4.0-cil (>= 1.0), libmono-system-numerics4.0-cil (>= 1.0), libmono-system-runtime-serialization4.0-cil (>= 4.0.0~alpha1), libmono-system-security4.0-cil (>= 1.0), libmono-system-servicemodel4.0a-cil (>= 3.2.3), libmono-system-transactions4.0-cil (>= 1.0), libmono-system-windows-forms4.0-cil (>= 1.0), libmono-system-xaml4.0-cil (>= 1.0), libmono-system-xml-linq4.0-cil (>= 3.0.6), libmono-system-xml4.0-cil (>= 3.12.0), libmono-system4.0-cil (>= 4.0.0~alpha1), libmono-windowsbase4.0-cil (>= 3.0.6), mono-xbuild (>= 1.0), msbuild-libhostfxr
Homepage: https://github.com/mono/msbuild
Download-Size: unknown
APT-Manual-Installed: no
APT-Sources: /var/lib/dpkg/status
Description: build platform for .NET and Visual Studio
 The Microsoft Build Engine is a platform for building applications.
 This engine, which is also known as MSBuild, provides an XML schema
 for a project file that controls how the build platform processes
 and builds software. Visual Studio uses MSBuild, but MSBuild does
 not depend on Visual Studio. By invoking msbuild.exe on your
 project or solution file, you can orchestrate and build products
 in environments where Visual Studio isn't installed.
 .
 This package contains the main msbuild build system

I note that I'm an older version of dotnet than you, so it's possible this is all a symptom of my updateophobia. I'll apply all updates and see if that changes things.

drewnoakes commented 6 years ago

This link has install instructions for Microsoft's official SDK package for Ubuntu 17.10.

https://docs.microsoft.com/en-us/dotnet/core/linux-prerequisites?tabs=netcore2x#install-net-core-for-ubuntu-1404-ubuntu-1604-ubuntu-1610--linux-mint-17-linux-mint-18-64-bit

I believe dotnet-host just provides the CLI tool (muxer) that exposes commands from other packages such as the SDK.

j4m3z0r commented 6 years ago

Got it. Looks as though many of my sources were disabled in my last OS update. I've re-enabled everything (including the repos listed on that page). Doing system upgrade now and will then install the dotnet-sdk package. Will report back once that's all done.

j4m3z0r commented 6 years ago

Ok, so the dotnet-host package is a dependency of dotnet-sdk, so it seems that it was the right package. I now see this:

$ dotnet --info
.NET Command Line Tools (2.1.4)

Product Information:
 Version:            2.1.4
 Commit SHA-1 hash:  5e8add2190

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  17.10
 OS Platform: Linux
 RID:         linux-x64
 Base Path:   /usr/share/dotnet/sdk/2.1.4/

Microsoft .NET Core Shared Framework Host

  Version  : 2.0.5
  Build    : 17373eb129b3b05aa18ece963f8795d65ef8ea54

However, when I run msbuild, I now get a different set of errors (633 of them, in fact), which look like they can't find the standard libraries. Eg:

  IO/SequentialReader.cs(282,63): error CS0246: The type or namespace name 'Encoding' could not be found (are you missing a using directive or an assembly reference?) [/home/james/src/3rdparty/metadata-extractor-dotnet/MetadataExtractor/MetadataExtractor.csproj]

There are similar errors for Linq and Dictionary among other fairly common types. I'm invoking msbuild like so, inspired by Build.ps1:

msbuild MetadataExtractor/MetadataExtractor.csproj /t:Restore,Build,Pack /p:Configuration=Release /p:PackageOutputPath=../artifacts

Not sure what's going on.

j4m3z0r commented 6 years ago

Ok, so the mystery of msbuild failing is one I'll need to investigate further, however, I can now build MetadataExtractor.Tools.FileProcessor from my IDE (Rider). Running the tests before and after this change show that nothing new is broken. I've updated my branch with the requested change.

drewnoakes commented 6 years ago

I wish I could be more help with the build problem but it's not something I have a tonne of familiarity with. If you do discover the problem/fix, please post it here for others too.

Cheers again for the PR. I think this should make a difference to those using FileStream too, which is the most popular stream type, I'd imagine.

j4m3z0r commented 6 years ago

Randomly restarting things after updating eventually got me to a place where I could build, which is not terribly satisfying. However, I no longer have any issues here so I'm going to go ahead and close this.

j4m3z0r commented 6 years ago

I may have been too hasty in closing this issue. Leaving the build issues aside (I can no longer repro them), it seems that there are some other ways that IndexedCapturingReader can inadvertently read the entire input stream. Eg: GetUInt16 in IndexedReader will call ValidateIndex in IndexedCapturingReader, which for those IFDs at the end of the file will result in it consuming the entire file.

It seems as though IsValidIndex is where the data is actually read, and that an index is considered valid only if the data is already held in memory, and it will only read sequentially (ie: no gaps).

Re-phrasing, it seems as though it has been engineered not to assume that streams are seekable (or perhaps to assume that seeking may be costly).

Perhaps the best solution would be to allow the caller to pass in an instance of IndexedReader directly, rather than having one constructed when you pass in a Stream object? That way the caller could make the trade-offs appropriate for their problem. However, that would require a fairly substantial interface addition: the addition of a method to all of the *MetadataReader classes.

So the two options I see for resolving this are:

Modify IndexedCapturingReader to use the Seek method on System.IO.Streams when CanSeek is true, and consider an index valid if the index is less than the stream's length, OR
Add methods to the *MetadataReader classes that allow the caller to pass in their own IndexedReader subclass which can be implemented with that strategy.

I'll understand if this isn't something you want to have built into your library. It's kind of a must-have feature for me, so I'd end up maintaining a fork in that case (which is totally fine). I do think this is a valuable addition though, so I'm definitely interested in figuring out a way to build this into your tree if you're open to it.

Thoughts?

kwhopper commented 6 years ago

While it won't be immediately helpful in this case, I've been working from time to time on a version of MetadataExtractor that has a single reader for all cases. It assumes all Streams are seekable from the usage side, but if the underlying Stream isn't then it will use buffering mitigation inside the reader similar to the indexed reader. It also allows for other interesting features if we want them, such as easily recording the exact index of a segment or tag.

Maybe it will work, maybe not - but it's a simpler reading model for sure and might be a basis for further revision to address this issue. In your case, if your stream wrapper inherits from Stream and reports that it's seekable then you would be able to pass it as the source of data and no extra buffering would be needed.

The buffering part of that is not complete yet, although the seekable reader code is working for the basic Jpeg, Exif, and Tiff reader classes already in the library. I'll post it at some point as a brand new branch to get some feedback. I won't even try to move everything into it until the buffering portion is working.

drewnoakes commented 6 years ago

@j4m3z0r you're right, there's more we can do here. Sounds like @kwhopper has some promising inroads underway here too.

Two issues come to mind:

buffering to the end of the stream when an invalid offset is provided, just to prove it's invalid, when we may be able to use the stream's length
we never utilise stream seeking

On the first point, I ran some experiments, modifying IsValidIndex to use the stream length if possible. I also updated the test harness program to output both the sum of all file sizes as well as the sum of the stream positions after extraction. This essentially gives us a measure, before and after, of whether we reduced the amount of buffering.

The results look good. For 1,357 files (2,058,362,410 bytes).

Before: 251,282,085 bytes

After: 42,560,067 bytes

That's a big difference. Though there are some large raw files in there that might make up the bulk of it. I haven't broken it down by file type. Still, it can't hurt if you're trying to reduce the total amount read from the stream.

What was interesting too was that performance actually went down for the entire scan by about 20%.

Before:

Completed in 18,683.58 ms Completed in 18,740.51 ms Completed in 18,695.66 ms

After:

Completed in 22,578.67 ms Completed in 22,315.58 ms Completed in 22,240.17 ms

These runs were against files on a reasonably snappy SSD, so other sources (network/slower disks) might see an improvement.

I think the reduction in performance can be attributed to reduced buffering. Loading data into memory in chunks before using it can't hurt. However I haven't profiled this so it's pure speculation really.

With this change there's definitely less allocation/pressure on the GC. I think I'll commit and push them.

The current situation with different reader classes was inherited from the original implementation in Java. Java has a different model of streams, and I was also a much less experienced developer 15 years ago. There's room for improvement in both language implementations. The .NET version could likely take better advantage of the SDK's APIs.

And on that topic there are some very interesting changes coming to .NET that'll allow some nice abstractions over memory (Span<T> and other various memory types). These should allow robust parsing with less allocation of intermediate heap objects, for example. I think there'll be some good stuff coming for the low-level gritty byte streaming and processing code.

I'll push up the change that causes fewer bytes to be read. Test it out if you get a chance.

drewnoakes commented 6 years ago

Ok that didn't work at all. I was focussing on the wrong thing.

I will have to think about this another time. I've rolled back master for now.

j4m3z0r commented 6 years ago

So I just put together a PR that implements this and works for what I need, though it is not without some compromises. See here:

https://github.com/drewnoakes/metadata-extractor-dotnet/pull/125

I've done it so that Streams that don't support Length or Seeking fallback to (an approximation of) the original algorithm of pre-fetching the whole file.

On my system, running the FileProcessor test suite per the contribution instructions takes about 2000ms now, whereas it was 1300ms before on this system. I spent some time trying to recover the lost performance, but I wasn't able to get it to be as fast as the original version. Some interesting observations:

For FileStreams, it is (significantly) faster to read the whole file in at the start -- I don't know why. I've added a check for this case in particular and force contiguous mode. I actually suspect we could do better yet by issuing a single read for the whole file into a single buffer in this case.
Since the buffered regions are no longer guaranteed to be contiguous, I'm using a Dictionary to map from chunk index to the buffered data. This seems to have more of a performance impact than I expected. To mitigate, I've added a memoization step to handle the common case of fetching many bytes from the same chunk sequentially.
I increased the default chunk size to 4KB (from 2KB), since HttpClient seems to read data in blocks of 4KB.

I also ended up removing _isStreamFinished, and there were a few other tweaks. I don't think I made it less correct, and the test-suite is unchanged.

Anyway, the impact for me is pretty huge: I can now get RAW file metadata with a few hundred KB of network traffic as opposed to 10s of MBs. Take a look and see what you think.

j4m3z0r commented 6 years ago

I'm a bit confused by GitHub's messages about what can and can't be merged, so I'm unsure if the HEAD of my branch is the appropriate one to look at. To allay any confusion, this is the pertinent changeset:

https://github.com/j4m3z0r/metadata-extractor-dotnet/commit/6f7e2f0dc7c265efe60df3b4846e3b476f970b00

drewnoakes commented 6 years ago

Don't worry too much about the conflicts. I can sort that out manually. I'm guessing it happened because your PR #123 was from your master branch, but when I merged it I rebased it. So then your master and the upstream master were out of sync before you started the next PR. But if you're going to spend time on anything, it'd be more useful to spend it on the unit tests.

j4m3z0r commented 6 years ago

I just started taking another look at this, but I'm running into trouble building the project, and I'm getting the same issue on a clean checkout of master. Here's the error I'm seeing:

Formats/Xmp/XmpDirectory.cs(57,16,57,24): error CS0433: The type 'IXmpMeta' exists in both 'XmpCore.StrongName, Version=5.1.3.0, Culture=neutral, PublicKeyToken=961f4f366277b80e' and 'XmpCore, Version=5.1.3.1, Culture=neutral, PublicKeyToken=null'
Formats/Xmp/XmpDirectory.cs(85,42,85,50): error CS0433: The type 'IXmpMeta' exists in both 'XmpCore.StrongName, Version=5.1.3.0, Culture=neutral, PublicKeyToken=961f4f366277b80e' and 'XmpCore, Version=5.1.3.1, Culture=neutral, PublicKeyToken=null'

I'll keep poking at this, but if this is a known issue with an easy fix, I'd love to hear about it. This is on Visual Studio on Mac, which updated itself just this morning. Cleaning the solution and rebuilding doesn't seem to change anything.

j4m3z0r commented 6 years ago

Ok, I found a workaround: delete the reference to the XmpCore.StrongName package. I also had to change the version of the test project to .net core 2.0. I'll omit both these changes once I get to putting together a PR.

j4m3z0r commented 6 years ago

Ok, I just pushed some changes to the PR I made:

Rebased against current master
Added test cases for all permutations of seekable and length supported
Fixed erroneous handling of the length value that was revealed by aforementioned testcases.

I've actually not used xunit before, so I hope I got that right -- let me know if not. The integration with VS is pretty neat! :)

drewnoakes / metadata-extractor-dotnet

Allow for streaming operation #122