JarlLemmens commented 3 years ago

Original article:

PDF URL: [Re] Object Detection Meets Knowledge Graphs.pdf Metadata URL: metadata.txt Code URL: https://github.com/tue-mps/rescience-ijcai2017-230

Scientific domain: Computer Vision Programming language: Python Suggested editor:

JarlLemmens commented 3 years ago

Metadata URL: https://github.com/tue-mps/rescience-ijcai2017-230/blob/main/metadata.yaml

rougier commented 3 years ago

Thanks for your submission, an editor will be assigned soon.

@benoit-girard Can you edit yourself or assign an editor?

JarlLemmens commented 3 years ago

Hi @rougier, thank you for the response. Do you have an idea how long the process of reviewing usually takes? Just wondering, since I am new to this :)

rougier commented 3 years ago

@JarfLemmens Srry for the delay. The whole process an take up to six months but it really depends on how fast we find editor and reviewers and how fast you answer. To accelerate things a bit, you can post messages here such that I get a notification and ask people to update their review.

rougier commented 3 years ago

@benoit-girard @gdetor Can one of you edit this submission?

JarlLemmens commented 3 years ago

@rougier Okay cool, thanks for the update!

rougier commented 3 years ago

@ReScience/editors Does any of you can edit this submission?

benoit-girard commented 3 years ago

Sorry for the lack of responsiveness. I wil handle the edition of this submission.

benoit-girard commented 3 years ago

@JasonGUTU Would you be interested in reviewing this ReScience submission?

benoit-girard commented 2 years ago

@hkashyap or @birdortyedi : would you be interested in reviewing this submission?

hkashyap commented 2 years ago

@hkashyap or @birdortyedi : would you be interested in reviewing this submission?

Sure, this is interesting to me and I can review it. @benoit-girard

benoit-girard commented 2 years ago

So @hkashyap will be the first reviewer, and @stepherbin will be the second one. Thanks to both of you for accepting this review.

stepherbin commented 2 years ago

Dear replication authors.

Here is a review of your work.

Summary

The objective of the authors is to verify the claim "that Knowledge Graph re-optimization can increase recall, while maintaining mean Average Precision (mAP)" on detection algorithms by replicating the learning process and evaluation on two standard datasets: Pascal VOC and MS COCO, as in the paper of [Fang et al., 2017]. Their conclusion is that the proposed approach only improves performance on MS COCO dataset while slightly degrading the performance on Pascal VOC, contrarily to the original paper, and therefore depends on the model used. The authors provide a full implementation based on pytorch.

Replication process and algorithm analysis

The presentation of the method to be replicated is clear and well summarized. One missing information in the original paper for reproduction (the fact that initial probabilities for re-optimization have to be set to zero) is clearly documented.

I expected a deeper analysis of the algorithm to explain its limits. I see this task as a central part of the replication study. Although the authors provide a small "beyond the paper" experiment, I find it insufficient.

For instance, questions like:

Can there be a negative impact of the KG reweighting scheme?
What are the decisions (categories and boxes) that have been modified by the re-optimization?
Since the KG improvement depends on the quality of the original detections, is it possible to assess what is the semantic cost of modifying those detections to get the ground truth?
What is the impact of the hyper-parameters?

Provided code

Quality

The code is rather easy to understand and sufficiently commented, with a simple API.

Installation

I couldn't install the packages using the yaml file (inconsistencies). I manually installed pytorch and several other packages in a conda environment and it seemed to be working. The complementary installation of pyrwr using pip worked.

Execution

I only checked VOC problem with default parameters. A generic script for regenerating all the results (for various KG methods) could be provided for complenetess. On the version I tested, the function generated syntactic errors for test_function_kg (l. 154) probably due the the fact that the output of the the prediction model (l. 136) is a simple tensor and not a list when the input list contains a single element (at least in my pytorch version). Fixing this error makes the code execute.

Results

=============================================

Outputs of 'python -m Results.results_voc' (on my computer using pytorch 1.10.1)

There are 4952 test images containing a total of 14976 objects. Files have been saved to rescience-ijcai2017-230/Datasets.

Currently testing: threshold = 1e-05 bk = 5 lk = 5 epsilon = 0.9 S = KG-CNet-55-VOC top k = 100

TP: tensor(11055.) FP: tensor(397270.) FN: tensor(977.) AP @ 100 per class: {'aeroplane': 0.710239589214325, 'bicycle': 0.7944151759147644, 'bird': 0.7083229422569275, 'boat': 0.49917998909950256, 'bottle': 0.5410887598991394, 'bus': 0.7477031350135803, 'car': 0.8304837346076965, 'cat': 0.8312810659408569, 'chair': 0.4660053849220276, 'cow': 0.8095996379852295, 'diningtable': 0.6266980171203613, 'dog': 0.8198830485343933, 'horse': 0.8325859308242798, 'motorbike': 0.763495922088623, 'person': 0.7816246747970581, 'pottedplant': 0.39205634593963623, 'sheep': 0.7277699112892151, 'sofa': 0.6615461707115173, 'train': 0.7348567247390747, 'tvmonitor': 0.6963554620742798} mAP @ 100 : 0.6987595558166504 Recall @ 100 per class: {'aeroplane': 0.8842105269432068, 'bicycle': 0.9673590660095215, 'bird': 0.9106753468513489, 'boat': 0.8669201731681824, 'bottle': 0.7846481800079346, 'bus': 0.9624413251876831, 'car': 0.9458784461021423, 'cat': 0.9664804339408875, 'chair': 0.8478835821151733, 'cow': 0.9836065173149109, 'diningtable': 0.9174757599830627, 'dog': 0.9795500636100769, 'horse': 0.954023003578186, 'motorbike': 0.944615364074707, 'person': 0.9339664578437805, 'pottedplant': 0.7625000476837158, 'sheep': 0.8966941833496094, 'sofa': 0.9748953580856323, 'train': 0.9503545761108398, 'tvmonitor': 0.8928571343421936} Recall @ 100 all classes (by average): 0.9163517951965332

====================================================

The figures are coherent with the replication paper, but with small variations. An uncertainty analysis with confidence intervals should be added to assess the replication paper statements.

General opinion

The paper describes a replication of the original work that seems sound to me. Several implementation details have been clarified after discussing with original authors. The analysis of the algorithm could have been deeper, though, in order to assess more precisely the benefit or limitations of the approach.

benoit-girard commented 2 years ago

@stepherbin Thank you very much for your feedback! @JarlLemmens You can start addressing the points raised in @stepherbin 's review. @hkashyap A gentle reminder: provide your review when possible.

JarlLemmens commented 2 years ago

Hi @stepherbin, first of all, sorry for the late reply. Secondly, thank you very much for putting in the time and effort to review our work!

I have checked the installation process on two separate machines, and in both cases it installs without any problems. Could you please provide a more specific error regarding the inconsistencies with installation? I think indeed the error in test_function_kg is a result of the different pytorch installation, as I am not getting this error on my machines.

Regarding the bullit questions you provided, I have some comments/questions for each.

Can there be a negative impact of the KG reweighting scheme? With KG reweighting scheme, you refer to the re-optimization of the detections using the knowledge graph information right? Table 2 (results for VOC) and Table 3 (results for COCO) show the differences between FRCNN and KG-CNET57 in recall and mAP. in the case of VOC we have a recall of 91.7 for the baseline (without re-optimization) vs 91.9 with re-optimization. For the mAP these are 70.4 vs 70.1. For VOC, albeit a small difference, there is indeed a negative impact on mAP when using the KG re-optimization. In the COCO case, there is a similar impact where recall is positively impacted, and mAP negatively. A decrease in mAP means that some detections that were correct without re-optimization have been re-optimized wrongly. So yes, there can be a negative impact. does this answer the question, or should I be more explicit in the paper about this?
What are the decisions (categories and boxes) that have been modified by the re-optimization? The re-optimization process updates the scores of each category of each detection, only if the re-optimized score of a detection outscores the original detection, the label (category) of that box will be updated. The (total) effect of re-optimization per class is also shown in table 2 for the VOC case. I was wondering if perhaps a small qualitative/case study on one or a few image samples would give more clarification to this matter? Just as in the original paper (https://www.ijcai.org/proceedings/2017/0230.pdf) in figure 4, but then also depicting the negative impact case. Would such an example be sufficient, or does that not answer your question?
Since the KG improvement depends on the quality of the original detections, is it possible to assess what is the semantic cost of modifying those detections to get the ground truth? I am not sure what you mean with this question.
What is the impact of the hyper-parameters? For which hyper-parameters would you like to see the impact? For the epsilon (in my opinion the one with the most (direct) impact) (epsilon being the trade-off between original detection and amount of re-optimization), this can quite easily be explained/showed with a separate table/graph by running the existing experiment for different epsilons. The other hyper-parameter with a lot of impact (the box score threshold) is already accounted for in the beyond the paper section.

I am looking forward to hearing your further inputs! Also I would like to gently remind @hkashyap for his review, so that I can work on an updated version of the paper that covers both your pointers. Thanks again!

hkashyap commented 2 years ago

@JarlLemmens sorry for the delay, I am currently working on the review and I will submit in a week.

JarlLemmens commented 2 years ago

@hkashyap thank you! looking forward to your review

hkashyap commented 2 years ago

Summary:

The authors of this ReScience submission (replication authors) reimplement the paper "Object Detection Meets Knowledge Graphs" by Fang et al. previously published in IJCAI-17. Since the original paper did not provide source code, the replication authors implemented it from scratch using newer PyTorch libraries, different from the original implementation. The most significant difference is that the replication uses ResNet-50 backbone for object detection, whereas the original paper used VGG backbone. Due to this change, multiple conclusions were derived, such as the re-optimization process is dependent on the object detector performance and knowledge graph approach can actually decrease performance for very good detectors. However, the replication study presented here does not include results using the same VGG backbone, which makes it difficult to compare to the results in the original study.

Replication process:

Installation:

The provided yaml file for the required Conda environment was not helpful due to package incompatiblity. Not sure if all packages are even required. I had to remove the exact version numbers of the packages to make it work, the authors should do that to the yaml in a minimalistic way.
The dataset and trained models are provided in an organized manner, which saves time, kudos for that.
I observed the same issue as the other reviewer about the output of the prediction model not being a list in my PyTorch installation. After fixing this, the code executed as expected.

Reproducibility:

I was able to replicate the replication results presented for both VOC and COCO datasets, as well as the Section 4.3 new experiment results with a higher threshold. In some cases, there were slight difference, but nothing significant and mostly rounding changes.

VOC:

Currently testing: threshold = 1e-05 bk = 5 lk = 5 epsilon = 0.9 S = KG-CNet-55-VOC top k = 100 TP: tensor(11069.) FP: tensor(400168.) FN: tensor(963.) AP @ 100 per class: {'aeroplane': 0.7179303765296936, 'bicycle': 0.8043712973594666, 'bird': 0.696021318435669, 'boat': 0.5145907402038574, 'bottle': 0.5439999103546143, 'bus': 0.7510495185852051, 'car': 0.8366398215293884, 'cat': 0.8359617590904236, 'chair': 0.48131272196769714, 'cow': 0.8080626130104065, 'diningtable': 0.6125404238700867, 'dog': 0.8125175833702087, 'horse': 0.8397406935691833, 'motorbike': 0.7771231532096863, 'person': 0.7843232154846191, 'pottedplant': 0.3963417112827301, 'sheep': 0.737895667552948, 'sofa': 0.6353936195373535, 'train': 0.7309917211532593, 'tvmonitor': 0.6964332461357117} mAP @ 100 : 0.7006620764732361 Recall @ 100 per class: {'aeroplane': 0.8947368860244751, 'bicycle': 0.9643917083740234, 'bird': 0.915032684803009, 'boat': 0.8631178736686707, 'bottle': 0.7889125943183899, 'bus': 0.9577465057373047, 'car': 0.9483763575553894, 'cat': 0.9608938097953796, 'chair': 0.8584656119346619, 'cow': 0.987704873085022, 'diningtable': 0.9271844625473022, 'dog': 0.9795500636100769, 'horse': 0.9597700834274292, 'motorbike': 0.9538461565971375, 'person': 0.9330830574035645, 'pottedplant': 0.7541667222976685, 'sheep': 0.9090908765792847, 'sofa': 0.9707112908363342, 'train': 0.9468084573745728, 'tvmonitor': 0.8928571343421936} Recall @ 100 all classes (by average): 0.9183223843574524

COCO:

Currently testing: threshold = 1e-05 bk = 5 lk = 5 epsilon = 0.75 S = KF-500-COCO top k = 100 AP @ 100 per class: {'person': 0.42004427313804626, 'bicycle': 0.23318567872047424, 'car': 0.2947446405887604, 'motorcycle': 0.29722216725349426, 'airplane': 0.43403273820877075, 'bus': 0.4945223927497864, 'train': 0.4786454141139984, 'truck': 0.2374054491519928, 'boat': 0.1704973727464676, 'traffic light': 0.19896620512008667, 'fire hydrant': 0.5331732630729675, 'street sign': 0.0, 'stop sign': 0.564350426197052, 'parking meter': 0.267412930727005, 'bench': 0.16225935518741608, 'bird': 0.25010353326797485, 'cat': 0.4558740556240082, 'dog': 0.4444156587123871, 'horse': 0.3762286603450775, 'sheep': 0.4005529284477234, 'cow': 0.3616058826446533, 'elephant': 0.5400511622428894, 'bear': 0.5956308245658875, 'zebra': 0.5560437440872192, 'giraffe': 0.5915590524673462, 'hat': 0.0, 'backpack': 0.09385406225919724, 'umbrella': 0.23406867682933807, 'shoe': 0.0, 'eye glasses': 0.0, 'handbag': 0.08345004916191101, 'tie': 0.22355425357818604, 'suitcase': 0.18850819766521454, 'frisbee': 0.4648464620113373, 'skis': 0.0853266566991806, 'snowboard': 0.17299018800258636, 'sports ball': 0.3863285481929779, 'kite': 0.39229217171669006, 'baseball bat': 0.23580093681812286, 'baseball glove': 0.3419892489910126, 'skateboard': 0.3180837035179138, 'surfboard': 0.2289624959230423, 'tennis racket': 0.4226250648498535, 'bottle': 0.24639634788036346, 'plate': 0.0, 'wine glass': 0.2846655547618866, 'cup': 0.3280509114265442, 'fork': 0.151094451546669, 'knife': 0.09519923478364944, 'spoon': 0.09142474830150604, 'bowl': 0.28933367133140564, 'banana': 0.15350903570652008, 'apple': 0.16788233816623688, 'sandwich': 0.2860325872898102, 'orange': 0.3574628233909607, 'broccoli': 0.20719854533672333, 'carrot': 0.1425948292016983, 'hot dog': 0.26880401372909546, 'pizza': 0.4864168167114258, 'donut': 0.36339670419692993, 'cake': 0.25741809606552124, 'chair': 0.18030241131782532, 'couch': 0.284697026014328, 'potted plant': 0.21564118564128876, 'bed': 0.3336133360862732, 'mirror': 0.0, 'dining table': 0.21570488810539246, 'window': 0.0, 'desk': 0.0, 'toilet': 0.41972097754478455, 'door': 0.0, 'tv': 0.4319237768650055, 'laptop': 0.43690046668052673, 'mouse': 0.4938998222351074, 'remote': 0.21422357857227325, 'keyboard': 0.3535045087337494, 'cell phone': 0.22163479030132294, 'microwave': 0.39696362614631653, 'oven': 0.197708398103714, 'toaster': 0.0, 'sink': 0.25485721230506897, 'refrigerator': 0.306932270526886, 'blender': 0.0, 'book': 0.08003350347280502, 'clock': 0.44794130325317383, 'vase': 0.3089244067668915, 'scissors': 0.15788370370864868, 'teddy bear': 0.2590126097202301, 'hair drier': 0.01155115570873022, 'toothbrush': 0.08186015486717224, 'hair brush': 0.0} mAP @ 100 : 0.29641902446746826 Recall @ 100 per class: {'person': 0.5312979817390442, 'bicycle': 0.39868998527526855, 'car': 0.4529411196708679, 'motorcycle': 0.4635983109474182, 'airplane': 0.5826446413993835, 'bus': 0.6214953660964966, 'train': 0.6061798334121704, 'truck': 0.5083085298538208, 'boat': 0.36059853434562683, 'traffic light': 0.342801570892334, 'fire hydrant': 0.6092308163642883, 'street sign': 0.0, 'stop sign': 0.7229999303817749, 'parking meter': 0.4679245352745056, 'bench': 0.3229965567588806, 'bird': 0.4244897961616516, 'cat': 0.6189743280410767, 'dog': 0.6221053004264832, 'horse': 0.5472119450569153, 'sheep': 0.5715469121932983, 'cow': 0.590721607208252, 'elephant': 0.697282612323761, 'bear': 0.7599999904632568, 'zebra': 0.6704141497612, 'giraffe': 0.7371428608894348, 'hat': 0.0, 'backpack': 0.3209790289402008, 'umbrella': 0.3830246925354004, 'shoe': 0.0, 'eye glasses': 0.0, 'handbag': 0.28855830430984497, 'tie': 0.3999999761581421, 'suitcase': 0.4179190695285797, 'frisbee': 0.6188235878944397, 'skis': 0.2819494605064392, 'snowboard': 0.3847058415412903, 'sports ball': 0.48247867822647095, 'kite': 0.5243542790412903, 'baseball bat': 0.40263158082962036, 'baseball glove': 0.46637168526649475, 'skateboard': 0.4699029326438904, 'surfboard': 0.4413612484931946, 'tennis racket': 0.5636986494064331, 'bottle': 0.4189937114715576, 'plate': 0.0, 'wine glass': 0.3996710777282715, 'cup': 0.5147222280502319, 'fork': 0.28594595193862915, 'knife': 0.29999998211860657, 'spoon': 0.26174864172935486, 'bowl': 0.5486692190170288, 'banana': 0.34923550486564636, 'apple': 0.4076233506202698, 'sandwich': 0.5113333463668823, 'orange': 0.5780612230300903, 'broccoli': 0.43418803811073303, 'carrot': 0.3876325488090515, 'hot dog': 0.48452386260032654, 'pizza': 0.6437869668006897, 'donut': 0.5281632542610168, 'cake': 0.4871920943260193, 'chair': 0.36778274178504944, 'couch': 0.547340452671051, 'potted plant': 0.4119718074798584, 'bed': 0.6014184355735779, 'mirror': 0.0, 'dining table': 0.43503791093826294, 'window': 0.0, 'desk': 0.0, 'toilet': 0.6198529005050659, 'door': 0.0, 'tv': 0.6224390268325806, 'laptop': 0.5922222137451172, 'mouse': 0.6698412895202637, 'remote': 0.4273223876953125, 'keyboard': 0.5109589099884033, 'cell phone': 0.37449997663497925, 'microwave': 0.5149999856948853, 'oven': 0.4214285910129547, 'toaster': 0.0, 'sink': 0.4689474105834961, 'refrigerator': 0.4726027548313141, 'blender': 0.0, 'book': 0.3266506493091583, 'clock': 0.59375, 'vase': 0.5, 'scissors': 0.3012820780277252, 'teddy bear': 0.4660605788230896, 'hair drier': 0.04444444552063942, 'toothbrush': 0.31707316637039185, 'hair brush': 0.0} Recall @ 100 all classes (averaged): 0.4728471636772156 Recall @ 100 small: 0.28937751054763794 Recall @ 100 medium: 0.5052471160888672 Recall @ 100 large: 0.6245576739311218

VOC with higher threshold (Beyond experiment):

Currently testing: threshold = 0.05 bk = 5 lk = 5 epsilon = 0.9 S = KF-500-VOC top k = 100 TP: tensor(10048.) FP: tensor(20716.) FN: tensor(1984.) AP @ 100 per class: {'aeroplane': 0.6866281032562256, 'bicycle': 0.7792302370071411, 'bird': 0.6703633069992065, 'boat': 0.48980414867401123, 'bottle': 0.5355979204177856, 'bus': 0.735668420791626, 'car': 0.819940984249115, 'cat': 0.8110095262527466, 'chair': 0.4661140739917755, 'cow': 0.8022223114967346, 'diningtable': 0.6221809387207031, 'dog': 0.8019189834594727, 'horse': 0.8133650422096252, 'motorbike': 0.7648594975471497, 'person': 0.775041401386261, 'pottedplant': 0.38187214732170105, 'sheep': 0.7274808883666992, 'sofa': 0.6267770528793335, 'train': 0.714433491230011, 'tvmonitor': 0.6733880043029785} mAP @ 100 : 0.6848948001861572 Recall @ 100 per class: {'aeroplane': 0.7859649062156677, 'bicycle': 0.8961424231529236, 'bird': 0.7930282950401306, 'boat': 0.6844106316566467, 'bottle': 0.6695095896720886, 'bus': 0.8967136144638062, 'car': 0.8917568325996399, 'cat': 0.916201114654541, 'chair': 0.6904761791229248, 'cow': 0.9262294769287109, 'diningtable': 0.8203883767127991, 'dog': 0.920245349407196, 'horse': 0.9109195470809937, 'motorbike': 0.8830769062042236, 'person': 0.8659452199935913, 'pottedplant': 0.5916666984558105, 'sheep': 0.8181818127632141, 'sofa': 0.9121338725090027, 'train': 0.8581560254096985, 'tvmonitor': 0.7792207598686218} Recall @ 100 all classes (by average): 0.8255184292793274

Code quality:

The code is highly readable and I did not have any problem navigating it.

Strengths:

First of all, this is the only open source implementation of an interesting paper and the replication authors implemented it from scratch. As the replication authors pointed out, many minute yet crucial details, such as epsilon values and tricks such as setting non-highest-scoring class probabilities were set to 0, would have been lost without this replication effort.
By using a different backbone network for object detection, the replication authors have broadened the scope of discussion. Is the re-optimization method dependent on the object detector performance? Is the re-optimization useful at all for top performing object detectors? These are interesting questions not discussed in the original paper.
The replication objectively shows with evidence that some of the claims in the original paper are not generalizable. The arguments are straightforward and clear.
Section 4.3 extra experiment shows that for strong detections (when the false positives are to be avoided, most practical cases), the re-optimization actually hurts performance. This is an interesting observation missed in the published paper.

Weaknesses:

As this is a replication, the first thing to do is implement the original method exactly, it could be using different libraries like PyTorch, but the method should be same. VGG and ResNet are two different networks and as a backbone to Faster R-CNN, it is difficult to compare them. Therefore, the replication should first implement the method using VGG backbone, compare against the original paper, before exploring and comparing against other backbones like ResNet.
In Section 5, an argument was made that ResNet50 already accounts for co-occurances and hence cannot be improved with re-optimization. On the other hand VGG can be still improved. While it may be possible, it cannot be argued conclusively based on the final mAP and Recall numbers alone. The argument needs to be verified with data, one way to measure this could be to look at how much P matrix changes during re-optimization for ResNet and VGG. If the P change is always significantly greater for VGG than for ResNet, then it could back the argument.
While the original paper is vague about mAP changes due to re-optimization, calling it "do not compromise mAP" and "without afecting mAP", they do not claim mAP improvement in the original paper. In fact, the published results also show decrease in mAP After re-optimization. Therefore, it should not be mentioned as not being in-line in Section 4.2. Instead the mAP decrease is stronger for ResNet. Again, a VGG backbone implementation will make this more clear.

Scope for improvement:

To test the argument that better object detectors learn the co-occurances better, the replication authors can compare results using progressively deeper ResNet architectures, to see there is a gradual improvement with deeper architectures.
Instead of comparing just two threshold values, the replication authors should show the trend between re-optimization improvement vs. threshold values.

Overall comment:

The submission provides a reimplementation of an interesting idea combing deep learning with knowledge based approaches, so it has value for the AI community. However, the comments of both reviewers need to be addressed in an updated version before it can be accepted.

benoit-girard commented 2 years ago

@hkashyap Thank you very much for your feedback!

benoit-girard commented 2 years ago

@JarlLemmens a gentle reminder that @hkashyap 's review awaits your answers.

@stepherbin do you have comments or requests after @JarlLemmens 's answers to your review?

JarlLemmens commented 2 years ago

Hi @benoit-girard thank you for the check-up. I was looking into @hkashyap 's feedback and trying to implement the vgg-16 backbone as suggested. I managed to implement the vgg-16 backbone in pytorch, using https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html ‘s explanation to modify the model to another backbone. I trained the model for the VOC dataset with similar parameters as described in the original paper, and run the experiments again, also with the described parameters.

I found the following results:

model	ResNet50	(as used before)	VGG-16	(trained by me)	VGG-16	(published by them)
metric	mAP	recall	mAP	recall	mAP	recall

FRCNN	70.4	91.7	59.0	88.9	66.5	81.9
KF-500	70.4	91.8	58.9	88.9	66.5	83.8
KF-All	70.4	92.1	58.9	89.1	66.5	84.6
KG-CNet	70.1	91.9	58.6	88.9	66.6	85.0

As you can see, there is quite the difference between the found results and the published results regarding the baselines. However, the knowledge re-optimization is done after the model is completely trained, so the trends in the results can still be compared. Interesting to see here is that these trends follow the earlier found trends with the ResNet50 model, where the mAP remains more or less equal (slightly less for the KG-CNet case) and the increase in recall does not match the findings of the original paper.

So in my opinion this experiment shows that the (results of the) original paper are not reproducible. The methodology however, is reproducible, and could be valuable for further research, perhaps in other fields rather than object detection. Since the original code is not available, I think it could be a great contribution to publish this reproducibility paper/code.

I wonder if the findings of this extra experiment will be sufficient for this work to be considered for publishing (along with the other comments), after including it in the paper ofcourse. I hope to hear from you regarding these new insights.

JarlLemmens commented 2 years ago

Hi, I was wondering if you by any chance had the opportunity to read my question already. @hkashyap @stepherbin @benoit-girard

hkashyap commented 2 years ago

Thank you for the additional experiment using VGG backbone. It is interesting to see that you achieved a similar trend in mAP and Recall as the ResNet results, yet different from the claim about increasing recall in the original paper. This to me confirms the non-reproducibility of the published results. However, can you explain the difference in mAP between the VGG results achieved in your experiment and the original paper? They are different ranges altogether. As I mentioned in my previous comments, I think it would be useful to compare P matrix changes for VGG and ResNet (multiple levels of depth) during re-optimization to confirm that the presented scheme actually is detrimental for strong detectors. With these added to the manuscript and the code, I think this reproduction study is suitable for publication in Rescience.

JarlLemmens commented 2 years ago

Hi @hkashyap, thank you for the comments, I just wanted to inform here that I am working on a revised version of the paper/code that includes the feedback, and that this will be coming in the coming weeks.

rougier commented 2 years ago

Any progress?

JarlLemmens commented 1 year ago

Hi @rougier, @benoit-girard, @hkashyap, @stepherbin,

I have updated the paper and code to address all discussed feedback.

The updated final paper, and relevant information can be found here:

Original article: https://www.ijcai.org/proceedings/2017/0230.pdf

PDF URL: [Re] Object Detection Meets Knowledge Graphs.pdf

Metadata URL: metadata.yaml

Code URL: https://github.com/tue-mps/rescience-ijcai2017-230

Scientific domain: Computer Vision Programming language: Python

To clarify the updates in this version:

I have updated the environment issues by including a requirement.txt file that only consists of the essential packages, instead of the complete environment. I have also updated the installation process in the read me accordingly.
The issue with the model output being a list/tensor has been resolved in the code.
I have included a script that will run multiple experiments such that the output of the experiments can be easily reproduced.
I have adapted the model to include a VGG-16 backbone (similar as the original paper) to verify the results of the original paper. As there was no readily available model for this, I had to recreate their original Caffe model in PyTorch, which I did to the best of my knowledge. There is still a small differences between both baselines, which is likely due to (small) differences in how both models are implemented in both environments. By adding this experiment, we can conclude that the original implementation is not reproducible at all.
I also included an additional experiment for the VOC dataset for a ResNet-18 backbone to show that our earlier claim that the presented scheme is detrimental for stronger detectors can be dropped, as this is not the case.
The results for the ResNet-50 tables have been updated as well (if someone would notice some different values there), this is because of a small change in parameters of our original implementation.
I expanded the beyond the paper section with 2 sets of experiments. The first looks at a set of different boxscore threshold values to see the effect of re-optimization on different scoring bounding box scores. The second extra experiment shows the effect of the hyperparameter epsilon which is the factor that decides the percentage of knowledge awareness or original scores.
- I have updated the relevant results/conclusion/summary sections accordingly to include all the new findings.
To finalize, we have included a comparison with the original implementation, with additional other models and experiments that go beyond the original paper, to conclude that the claims in the original paper are not reproducible.

rougier commented 1 year ago

I think your article is in good shape and @benoit-girard can probably makes a decision soon. Don't hesitate to remind us here if you don't seen any progress soon (or email me directly). We're in the middle of a transition to a new (more responsive) system and this might delay a bit the processing of this submission.

benoit-girard commented 1 year ago

Sorry for the delayed answer. I do validate the paper as it is, thank you for all the provided improvements. We will start the publishing process.

JarlLemmens commented 1 year ago

Thank you @rougier and @benoit-girard, that's great news! Is there anything that I can/have to do in the coming period?

rougier commented 1 year ago

Sorry for delay. Can you point me to the sources of your article? I'll need them for publication (as well as the metadata.yaml file)

JarlLemmens commented 1 year ago

Hi @rougier

The updated final paper, and relevant information can be found here:

Original article: https://www.ijcai.org/proceedings/2017/0230.pdf

PDF URL: [Re] Object Detection Meets Knowledge Graphs.pdf

Metadata URL: metadata.yaml

Code URL: https://github.com/tue-mps/rescience-ijcai2017-230

rougier commented 1 year ago

Sorry, I meant the sources for your article (I'll need to recompile it several times)

JarlLemmens commented 1 year ago

Do you mean the Latex source code?

rougier commented 1 year ago

Yes. Or else you can fill the metada.yaml (asking me for missing information), re-compile the PDF and send me links. But we'll need to do that 2 or 3 times.

JarlLemmens commented 1 year ago

I have added the sources to link. Please let me know if this indeed what you need.

rougier commented 1 year ago

Yes, but I cannot compile it. Can you?

JarlLemmens commented 1 year ago

Oh.. I must admit that I worked with the Overleaf template, so I filled in the metadata.tex file manually as suggested (but still filled in the yaml file later, because I thought you needed it). So I downloaded the article sources from there. I tried using the 'make' command, but it gives me indeed an error that it cannot find a specified file.. I don't know if I made a mistake somewhere. but if it's the easiest, I can just manually edit the metadata.tex file again and recompile in overleaf? or do you require the compiling with 'make' to function necessarily?

rougier commented 1 year ago

Ok. Then you can fill editor + reviewers (look at rescience board page for ORCID), fill dates and use volume 9, issue 1, number 1. Generate a new pdf and pos both pdf and metadata here. I'll use them to get a DOI that you'll need to inegrate on second step.

JarlLemmens commented 1 year ago

Hi,

Sorry if it takes a couple days for my responses, I am currently finishing up on my Master's thesis, so it is a bit hectic.

I couldn't find @stepherbin info on the rescience board page, but I managed to find it online.

The metadata is updated and still here

The new article: Re.Object.Detection.Meets.Knowledge.Graphs.pdf

JarlLemmens commented 1 year ago

Hi @rougier, just noticed that I didn't tag you in my last message, so here's a quick notification :)

rougier commented 1 year ago

Sorry for delay. We'll need to synchronize at some point. In your PDF, check the "A replication of Fang2017ObjectGraphs." which sould be replaced by the actual citation. In the metadata, add quote around replication / cite and try to fill in the abstract.

JarlLemmens commented 1 year ago

Hi @rougier, I tried adding the quotes in the metadata, but that did not change anything. I guess it has something to do with the (standard) header.tex file. Here also the abstract is commented out, so I don't know what's up with that. I added the abstract to the metadata, but the pdf is unchanged as to before.

rougier commented 1 year ago

Ok. Then you can directly modify the latex file (changes will be lost next time but we have only 2 passes before publication)

JarlLemmens commented 1 year ago

Hey @rougier, sorry for the late response again, but I'm now fully done with my masters, so I should be able to finish this soon. I'm still a bit confused about how it should look like. I tried looking at earlier published works, but they don't seem to have a standard layout to these regards. If you could perhaps link me one of the earlier published works that have the same citation etc, that would be very helpful!

rougier commented 1 year ago

Everything is almost good. Last point is "A replication of Fang2017ObjectGraphs" that shoudl be replaced with "A replication of Y. Fang, K. Kuan, J. Lin, C. Tan, and V. Chandrasekhar. “Object detection meets knowledge graphs.” In: Proceed- ings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, {IJCAI-17}. International Joint Conferences on Artificial Intelligence, 2017, pp. 1661–1667."

At this point, the easiest is to directly modify the latex.

JarlLemmens commented 1 year ago

Ah okay I see what you mean now. Re.Object.Detection.Meets.Knowledge.Graphs.pdf Here is the adapted version again

rougier commented 1 year ago

Perfect! Let's try to publish it now!

rougier commented 1 year ago

Given that I cannot compile your sources, easisest way would be to have a quick visio meeting (10mn) to speed things up. Can you contact me by mail such that we can find the best time to meet?

rougier commented 1 year ago

It's online ! Thanks again and veyr sorry for all the delays. https://rescience.github.io/read/

JarlLemmens commented 1 year ago

Yess! Very cool to see it online :D Thank you!

ReScience / submissions

[Re] Object Detection Meets Knowledge Graphs #59

Summary

Replication process and algorithm analysis

Provided code

Quality

Installation

Execution

Results

=============================================

====================================================

General opinion

Summary:

Replication process:

Installation:

Reproducibility:

VOC:

COCO:

VOC with higher threshold (Beyond experiment):

Code quality:

Strengths:

Weaknesses:

Scope for improvement:

Overall comment: