janice_detect harness holds JaniceDetection objects

taa01776 commented 6 years ago

Although the use of janice_detect_batch may lead to increased efficiency, it also leads to the use of a lot of memory. JaniceDetectionType objects need to somehow hold on to some form of the media behind them. Because the janice_detect harness calls janice_detect_batch on the entire input protocol file, all of the detections from the entire protocol are in memory at the same time, with their media. This could get problematic for large input sets.

JordanCheney commented 6 years ago

@taa01776 I can expose batch size as a parameter to detect. In practice, users will need to determine optimal batch sizes based on the algorithm, resources available etc. I'll add documentation for this in the harness as well.

carlosdcastillo commented 6 years ago

I guess @taa01776 is touching on a fundamental issue. And @JordanCheney as a user of the library is exercising it. Are detections supposed to be fast (and large) or succint (and slow). We could hold on to the file name (instead of the file contents if the second is true). See my comments on janice_io.

taa01776 commented 6 years ago

There are a few issues here:

Once you make a JaniceMediaIterator out of a file, you no longer have access to the name of the file. Obviously one wants the ability to have an image that has no name (because it's a face chip or something), but having the name around where possible is really handy for logging, which is one of my obsessions.
In addition, if the JaniceImage were a little less anonymous, it would be possible to pass the image itself around while only obtaining the data when / if needed. The current API has the advantage of being very simple, but I'd be happy with a little more complexity in exchange for the ability to write better code. (Simple case: generally speaking, you don't need to load the whole image to get its metadata--in particular, its size. Right now, in order to validate the inputs for enroll_from_detections, I have to load all of the images, even though they aren't strictly needed until much later (and in fact may not be needed at all, if all the detections are out of bounds).
The "batch" operations in the API aren't really suitable for full-throttle use of a GPU--because what are you doing with the GPU while the batch is being assembled? (This is most noticeable in the janice_enroll_detections harness--read a bunch of rectangles, turn them into detections (which means the images have to be loaded and chipped), then finally get them to enrollment. I haven't had time to make multi-threaded versions of any of those guys (where one thread would produce detections from the input file, and the other would consume them and produce templates); gotta think it'd be faster.

JordanCheney commented 6 years ago

JaniceMediaIterators are not guaranteed to come from a file and in practice will likely not come from a file. Existing media management tools will pass data in memory through the media interface (and will never use the filesystem for performance reasons) and many users will write their own implementations of janice_io to deal with custom DB formats, image formats etc. @taa01776 to your point on logging, I'm amenable to adding a name function to the interface which returns a unique name for that media for the purpose of logging, as long as it's understood that it might not be a file name.
Essentially, instead of members like data, height, and width, you would like functions like data(), height(), and width() that could be evaluated as needed (with the understanding that height() might not load all of the data)?
The harness certainly isn't squeezing out every ounce of performance possible and @taa01776 you're multi-threaded solution should be faster. I will add multi-threaded harness executables to my queue.

@carlosdcastillo to you're point on JaniceDetection being fast or succint, because the memory footprint of JaniceMediaIterator is not defined, and cannot be guaranteed to be small (see my first point) I would suggest being fast, with the important side-effect of you controlling the memory usage from that point forward.

carlosdcastillo commented 6 years ago

@JordanCheney Let’s think this through. Are you sure you want to serialize those detections? For CS5, for example a directory of serialized detections will be many tens of terabytes of uncompressed bytes representing every frame on which the detector was run on. To avoid this issue, people use JPG, and with JPG we’re taking about 88 pieces of 50GB (the total distribution size of CS5).

So long we’re all on the same page, I’ll write the code compile it and test serializing a couple of detections and wish everybody good luck using it.

JordanCheney commented 6 years ago

@carlosdcastillo - You're comment has led to a long internal discussion on our end. The requirement that detection store image information was a request from a commercial provider who had a computationally expensive preprocessing step before detection and enrollment. The idea was they could do the preprocessing before detection and then cache the result for enrollment. However, with updates to the API like janice_enroll_from_media there are mechanisms for doing this caching internally in the detection+enrollment case. For janice_enroll_from_detections we feel there is a strong assumption of a human in the loop, adjudicating multiple sightings of the same person to build a stronger template. Any added overhead of redoing operations on the media will be far smaller than time the human takes to do the adjudication.

Based on this, we've concluded that the requirement that detections store image information is overly constrained. I propose the following changes. JaniceDetection will still exist as an opaque type but will only be required to hold a JaniceTrack. Implementations can use a detection as an intermediate value cache if they would like. Detection can also hold optional metadata like gender and age if the implementation supports it. Those values can be queried with janice_detection_get_attribute. janice_enroll_from_detections and janice_enroll_from_detections_batch will be amended to include the relevant JaniceMediaIterators as an input parameter. Functions that previously returned JaniceTracks (janice_enroll_from_media, janice_cluster_media) will now return JaniceDetections.

If we are all amenable to these changes, I will push them onto the v6.0 branch for review.

carlosdcastillo commented 6 years ago

@JordanCheney This sounds good. The trade off we're making is that for a significant decrease (100 - 1000x) in memory footprint we're delegating to the user of the library the responsibility for sending in exactly the same media for detection and template computation. If they don't handle this responsibility well, they'll get garbage.

The library implementer may use MD5 or similar to verify that the image they got at detection time is the same image they're getting at feature computation time.

JordanCheney commented 6 years ago

This is addressed by 89e0143

Noblis / INVSC-janice

janice_detect harness holds JaniceDetection objects #21