Closed carlosdcastillo closed 8 years ago
Hi Carlos,
Thanks for the feedback. I'll try and address each of your points below.
1) No set tuning data function: I'm a bit confused here so I hope you can clarify. The phase 2 API isn't backwards compatible with the phase 1 API. This would be true even if set_tuning_data existed in both APIs so I'm confused as to why that is the only hurdle to maintain a single executable? I agree with you that this is not ideal. The long term solution will hopefully be the release of IJB-B with updated protocols. If set tuning data really is the only thing preventing you from having a single executable then adding a call into the phase 2 API strictly for backwards compatibility (i.e it would never be called during a CS3 evaluation) is certainly possible.
2) Detection confidence: The face detection protocol only includes still images and the key frames from the videos so tracking is not evaluated during this test. This was done because we could not feasibly annotate every frame in a video and instead annotated only the key frames. This makes accurately evaluating tracks challenging. Instead, tracking will be implicitly evaluated by the clustering and 1-N search with the video probe protocols. With this distinction in mind is it still necessary to have a per-frame detection confidence and an overall track confidence?
3) The meaning of cluster confidence: You are correct that a PR curve can't be formed using the cluster confidence. The metrics we are planning on using to evaluate clustering are still under internal discussion and review and will be released along with the clustering protocol in the near future. I'll be able to comment on the specific use for cluster_confidence when we have those finalized (it is possible that it will not be used).
4) Use of a similarity score instead of cluster magnitude: The meaning of a similarity score is performer dependent. I'm not sure how we could select one that has value to all performers. Is there some normalization scheme you are proposing for this value?
5) Repeated computation: We are actively trying to avoid repeated computation wherever we can for all of the reasons you give above. It was a main motivator behind moving from 10 splits with many templates shared across them to the current set-up of disjoint probes and galleries. Can you give examples in the current set of protocols where you see operations being run repeatedly?
Hi Jordan,
I'm not talking about binary compatibility. I'm not talking about source compatibility. I'm talking about conceptual compatibility. IJB-A and CS2 were built around the splits and the training view for each split. With the training views features could be fine tuned and embedded for each split. This was important to obtain the best and was achieved in calls to set_tuning_data.
What I'm saying is that for sure, without set_tuning_data there won't be a way to implement a harness for IJB-A. Not if we compile again. Not using 1 program, or n programs. Therefore evaluating phase1 to phase2 progress will not be possible. You suggest there are other hurdles for implementing an IJB-A executing harness over the phase 2 API, I haven't gotten that far.
Let me make our position clear. It is not that we want there to be a set_tuning_data. What Maryland really wants is that the situation doesn't converge to Maryland needing to maintain in phase 2 a phase 1 API implementation and phase 2 API implementation so IJB-A can be evaluated on our phase 2 deep networks in an independent T&E.
Let me think about (2). We agree about (3).
For the similarity score you can do many things to normalize and obtain a reasonable threshold. For example look at the ROC (say CS3 1to1) and grab the threshold that achieves a certain false positive rate (say 0.001). You will be able to do this for every performer. The important thing is that anything you do here, sounds a lot more reasonable than: we want to know how many distinct identities are present in a collection, and we start by giving you the result we want (K). In any realistic scenario K will not be available.
It seems by looking at the new harness that when processing the file: protocol/cs3_1N_probe_mixed.csv the line item: SUBJECT_ID='9997', FILE='img/101771.jpg' will be present in 4 calls to janus_create_template and so will: SUBJECT_ID='9996', FILE='img/107017.jpg' and every other line in that csv occurs at least twice in the csv file.
In total it seems that 62k lines in the csv file will be processed between 2 and 4 times. So what could be 62k computations of descriptors balloons into 229k computations of descriptors. I'm not sure if I'm missing something. If not, this seems like wasteful computation. Depending on how much processing the mode requires (is there keypoint computation done for each line item? is there an invocation of the detector done for each line item?), this 3.5x extra computation can result in a fairly significant amount of time, in absolute terms.
Best,
Carlos
Hi Carlos,
(1) I understand your point now. Let me think about the set of updates that would be required to run IJB-A through the phase 2 protocol and get back to you on this.
(4) This is interesting. @paddygr does NIST have thoughts on the pros and cons of this type of similarity score instead of an order of magnitude hint?
(5) We repeat imagery to build a larger and more comprehensive set of probe templates for the evaluation. In this case detection is not required. There is a possibility that a consistent set of operations is repeated across each of these images when they appear multiple times (for example, as you suggest, landmark detection). Caching these operations however, would be complicated at the API level. My initial thought is that it would require an intermediate data type (janus_preprocessed_image perhaps) and an additional function call. An updated path for template creation would then be something along the lines of (pseudocode)-
typedef struct janus_preprocessed_image_type *janus_preprocessed_image; // performer defined type
for (all_images) {
// This would be optional depending on the protocol
janus_media input_image;
vector<janus_track tracks;
janus_detect(input_image, min_size, tracks);
// Preprocess and cache the image
janus_preprocessed_image pp_image;
janus_preprocess_image(input_image, tracks, pp_image);
// .. serialize and save the pp_image to disk or a db or whatever here
}
// load list of images in each template
for (template in list) {
// load pp_images from cache for a template
janus_create_template(pp_images, template)
}
Is this something like what you had in mind?
Hi Jordan,
Yes, something like that is what I had in mind.
Best,
Carlos
Rather late here, apologies, but my thoughts on the various issues:
IJB-A: The IJB-A protocol doesn't require training, it just allows it. Yes the code in Phase 1 called fine tuning but the IJB-A test itself can proceed ignoring the training partitions (as we did with COTS). The new API supports gallery training, but not operations on a defined partition. But the Phase 2 deliverable allows a test harness to execute 1:1 and open-set 1:N on CS2, CS3 or entirely different data. including the IJB-A test data. There training / fine tuning is now off the table as those partitions cannot be used - you're reliant only on external training data for any fine tuning. Yes this renders comparison of your Phase 1 and 2 systematically different, but I think this reflects the decision that splits and training partitions are operationally rare. I think you do not need to support the Phase 1 API.
Clustering hint: I'm fine with the proposed minimum similarity score. It's not portable across algorithms but that's ok. I disagree that K number of identities (order of magnitude) would be impractical. Child exploitation investigations might know the number of victims is below O(10^3) in any seized hard drive. A drivers license database would be known, say O(10^7). The backdrop here is that while clustering is much sought after by analysts in the government, it is problematic in an evaluation because in practice it is a human-led iterative process in which some parameter is set (score, K, etc) and the results are visually inspected, and a re-clustering is done. The re-clustering might be fast if the input score is higher. It might be expensive if the parameter was reduced. I think clustering is not the easiest way to do comparative assessment of underlying face recognition capability. The inclusion of clustering in Phase 2 is in large part to prepare for transition of the capability.
Caching: This discussion is an artifact of CS2/3 reuse of imagery, which is atypical operationally. I thought this was less important in CS3 as it doesn't use as many splits. Surely you can again use Redis internally - hiding it from the calling application. The Phase 2 deliverable could even omit it. I don't the API need to support this. Operationally, the calling application would, for example, put templates in a SQL database, and not regenerate un-necessarily.
Hi Carlos, can you respond to Patrick when you have the chance? I'd like to get a consensus here before I make changes.
Thanks! Jordan
Hello Jordan and Patrick,
I think we're pretty much on the same page. Let me summarize:
Best,
Carlos
Hi Carlos,
Thanks for the response. I agree on all points. With that in mind I am not going to make changes to the API based on this discussion at this time. Please let me know if there is still a change you would like to see added.
Best, Jordan
On your third bullet, "All parts understand that from now on the results that Maryland publishes on IJB-A will not be reproducible with the deliverable." I either don't understand or don't agree. I think the Phase 2 deliverable applied to the IJB-A data by either NIST or by UMD should produce identical results. IJB-A would proceed without use of the training partitions.
Patrick
From: "Carlos D. Castillo" notifications@github.com<mailto:notifications@github.com> Reply-To: biometrics/janus reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, June 6, 2016 at 12:10 PM To: biometrics/janus janus@noreply.github.com<mailto:janus@noreply.github.com> Cc: Patrick Grother patrick.grother@nist.gov<mailto:patrick.grother@nist.gov>, Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [biometrics/janus] High level API issues (#35)
Hello Jordan and Patrick,
I think we're pretty much on the same page. Let me summarize:
Best,
Carlos
You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/biometrics/janus/issues/35#issuecomment-224006877, or mute the threadhttps://github.com/notifications/unsubscribe/AOuSte1Q18Jtmi8R2JiZ9URDJjnazB8iks5qJEZxgaJpZM4IlJI8.
When everybody else in the community is exercising their option to train on IJB-A splits, Maryland will exercise their option to train on IJB-A splits. And the curve(s) shown in the paper(s) we publish will be Maryland exercising our option to train on IBJ-A splits. Given that the deliverable won't have set_training_data the deliverable won't reproduce the curves in the paper.
I believe this is resolved.
Maryland is concerned about the following high level issues with respect to the Phase 2 API:
As there will not be any training data in CS3 and no support for tuning in the API: it will not be possible for the Government to run IJB-A with the phase 2 deliverable. IJB-A requires set_tuning_data. The phase 2 deliverable will not have set_tuning_data. It is difficult to maintain 2 distinct recognition deliverables, one with phase 1 API and one with phase 2 API. In practical terms, the historical record of results on IJB-A will end effectively when CS3 is released.
We would like to know more about the “detection_confidence” We think that both detection confidence and tracking confidence should be used . We are required to provide a single detection score for a track, whereas we have scores corresponding to faces in every frame in the track. It is difficult to distinguish detection confidence between two frames of same video which would be an issue during face detection evaluation.
We think that both detection confidence and tracking confidence are important for indicating the quality of a face track and should be reported/required. They are different things and have different meanings.
The output of a clustering algorithm is a pair: cluster_id, cluster_confidence required for janus_track. cluster_confidence also needs clarification. A precision-recall curve cannot be built by thresholding cluster confidences, if the cluster ceases to exist the data items need to be assigned to a different cluster, they cannot be entirely dropped.
Supervised Clustering: In this case, we are supposed to cluster a collection of unlabelled people into distinct identities. We will be provided with bounding boxes of those people and there is a hint parameter which would tell us about approximate number of clusters. The final result should be an approximate estimate (closest multiple of 10). We feel that assuming that we have (or almost have) K is impractical. We would like to suggest the lowest similarity between two descriptors that need to be clustered together.
It seems from CS3 line items that the same line item needs to be processed many times. The face detection and alignment of a given line item will not change. And will be processed many many times. This is wasteful. This was important for CS2 but even more so for CS3. We tried to provide caching of results around Redis and it was not very well received. I think we really need guidance as to caching results. Or a protocol design that will not evaluate the augmentation of the same line items over and over again. This will avoid running the same face detection on the same stills and frames and will be a big win both for the evaluators and also for us as we will need to test the algorithm(s) many many times.