Closed long8v closed 3 months ago
Interesting - does this change if you compute AP as described in appendix B.1 in the paper? I don't think that sklearn is using the same process. @spetryk - do you have any thoughts on this?
I am confused.. If I follow appendix B.1., it returns 0.0733
ap = 0
for sample in output:
if sample["foil"]: # .. positive label (1) to be “hallucination” .. in appendix
ap += (1 - sample["CLIPScore"])
print(ap / len(output))
0.07338134765625
but in the next paragraph it seems that it follows standard AP metric, which is assumed to be same metric as sklearn implementation.
Can you share snippet for calculating AP in paper? I assume three possibilities can happen 1) AP score I measured is inaccurate. 2) something I missed in CLIPScore metric. I compare with CLIPScore repo vs aloha/src/aloha/metrics/clipscore.py and found no difference to make change in output. 3) HAT dataset in this repo is not same set with dataset used in a paper?
The AP calculation above is slightly wrong, since ALOHa is inverted to CLIPScore (i.e. for ALOHa you need 1-, while for CLIPScore it should just be the raw value). That being said, it shouldn't matter that much, since the scores should be similar to sklearn. Are the other measures that you are generating similar to the paper results?
Looking back at our experimental results, it seems like something is different in the code that is being used, since for our code, we got a CLIPScore for image COCO_val2014_000000023709.jpg of 0.73291015625, instead of 0.70068359375.
I looked a bit at the committed code, and it looks like this might be the culprit: https://github.com/DavidMChan/aloha/blob/e38d69e0004a044254cef2641985c7ae4e01efd4/src/aloha/metrics/clipscore.py#L175C1-L176C29
Can you try changing this to the "ViT-B/32" version of CLIP and see if you get the higher scores?
The AP calculation above is slightly wrong, since ALOHa is inverted to CLIPScore (i.e. for ALOHa you need 1-, while for CLIPScore it should just be the raw value)
I believe CLIPScore is also need to be inverted since CLIPScore means how they are aligned, so caption with FOIL should assign lower score in CLIPScore. Can you provide AP calculation snippet used in paper so do I exactly reproduce?
Looking back at our experimental results, it seems like something is different in the code that is being used, since for our code, we got a CLIPScore for image COCO_val2014_000000023709.jpg of 0.73291015625, instead of 0.70068359375.
I did not use "RN50x64" but "ViT-B/32", since I used CLIPScore repo not this ALOHa repo. I compared CLIPScore repo vs aloha/src/aloha/metrics/clipscore.py and found no difference to make change in output. Do you have any other guesses? 👀
I think it might be from environment :/ I change environment to below, and COCO_val2014_000000023709.jpg retuns 0.7119140625
torch==1.7.1
torchvision==0.8.2
numpy==1.20.3
scikit-learn==0.23.1
(I referred to initial commit https://github.com/openai/CLIP/commit/3bee28119e6b28e75b82b811b87b56935314e6a5) However, it still differs from your report(0.73291015625), so it would be so helpful to have environment(torch, torchvision, numpy) you used.
[{"image_id": "COCO_val2014_000000016903", "CLIPScore": 0.8759765625, "foil": true}, {"image_id": "COCO_val2014_000000023709", "CLIPScore": 0.7119140625, "foil": false}, {"image_id": "COCO_val2014_000000553561", "CLIPScore": 0.90380859375, "foil": false}, {"image_id": "COCO_val2014_000000090367", "CLIPScore": 0.8720703125, "foil": false}, {"image_id": "COCO_val2014_000000539226", "CLIPScore": 0.6904296875, "foil": false}, {"image_id": "COCO_val2014_000000122838", "CLIPScore": 0.69921875, "foil": false}, {"image_id": "COCO_val2014_000000450577", "CLIPScore": 0.82470703125, "foil": false}, {"image_id": "COCO_val2014_000000196660", "CLIPScore": 0.7900390625, "foil": false}, {"image_id": "COCO_val2014_000000089541", "CLIPScore": 0.97216796875, "foil": false}, {"image_id": "COCO_val2014_000000228013", "CLIPScore": 0.79541015625, "foil": false}, {"image_id": "COCO_val2014_000000226579", "CLIPScore": 0.8583984375, "foil": false}, {"image_id": "COCO_val2014_000000464689", "CLIPScore": 0.7802734375, "foil": true}, {"image_id": "COCO_val2014_000000536292", "CLIPScore": 0.8564453125, "foil": false}, {"image_id": "COCO_val2014_000000331799", "CLIPScore": 0.71337890625, "foil": true}, {"image_id": "COCO_val2014_000000266491", "CLIPScore": 0.76171875, "foil": true}, {"image_id": "COCO_val2014_000000570594", "CLIPScore": 0.62451171875, "foil": false}, {"image_id": "COCO_val2014_000000481710", "CLIPScore": 0.7861328125, "foil": false}, {"image_id": "COCO_val2014_000000461953", "CLIPScore": 0.837890625, "foil": false}, {"image_id": "COCO_val2014_000000206751", "CLIPScore": 0.80859375, "foil": true}, {"image_id": "COCO_val2014_000000218205", "CLIPScore": 0.7021484375, "foil": false}, {"image_id": "COCO_val2014_000000016161", "CLIPScore": 0.91552734375, "foil": false}, {"image_id": "COCO_val2014_000000134103", "CLIPScore": 0.76708984375, "foil": false}, {"image_id": "COCO_val2014_000000103870", "CLIPScore": 0.87646484375, "foil": true}, {"image_id": "COCO_val2014_000000491154", "CLIPScore": 0.9189453125, "foil": false}, {"image_id": "COCO_val2014_000000538721", "CLIPScore": 0.6875, "foil": true}, {"image_id": "COCO_val2014_000000234676", "CLIPScore": 0.70068359375, "foil": false}, {"image_id": "COCO_val2014_000000382512", "CLIPScore": 0.826171875, "foil": true}, {"image_id": "COCO_val2014_000000006701", "CLIPScore": 0.779296875, "foil": false}, {"image_id": "COCO_val2014_000000333190", "CLIPScore": 0.76416015625, "foil": true}, {"image_id": "COCO_val2014_000000050753", "CLIPScore": 0.8134765625, "foil": false}, {"image_id": "COCO_val2014_000000345469", "CLIPScore": 0.8994140625, "foil": false}, {"image_id": "COCO_val2014_000000489023", "CLIPScore": 0.66015625, "foil": false}, {"image_id": "COCO_val2014_000000221725", "CLIPScore": 0.818359375, "foil": false}, {"image_id": "COCO_val2014_000000535997", "CLIPScore": 0.69091796875, "foil": false}, {"image_id": "COCO_val2014_000000367429", "CLIPScore": 0.8798828125, "foil": false}, {"image_id": "COCO_val2014_000000411587", "CLIPScore": 0.86376953125, "foil": false}, {"image_id": "COCO_val2014_000000578703", "CLIPScore": 0.77099609375, "foil": true}, {"image_id": "COCO_val2014_000000101280", "CLIPScore": 0.80126953125, "foil": true}, {"image_id": "COCO_val2014_000000577310", "CLIPScore": 0.826171875, "foil": false}, {"image_id": "COCO_val2014_000000167656", "CLIPScore": 0.6572265625, "foil": false}, {"image_id": "COCO_val2014_000000209835", "CLIPScore": 0.740234375, "foil": false}, {"image_id": "COCO_val2014_000000261116", "CLIPScore": 0.86474609375, "foil": true}, {"image_id": "COCO_val2014_000000224037", "CLIPScore": 0.69384765625, "foil": false}, {"image_id": "COCO_val2014_000000183407", "CLIPScore": 0.779296875, "foil": false}, {"image_id": "COCO_val2014_000000347675", "CLIPScore": 0.7490234375, "foil": true}, {"image_id": "COCO_val2014_000000280918", "CLIPScore": 0.9140625, "foil": false}, {"image_id": "COCO_val2014_000000083113", "CLIPScore": 0.8642578125, "foil": false}, {"image_id": "COCO_val2014_000000010432", "CLIPScore": 0.8544921875, "foil": true}, {"image_id": "COCO_val2014_000000173574", "CLIPScore": 0.71044921875, "foil": true}, {"image_id": "COCO_val2014_000000561214", "CLIPScore": 0.75439453125, "foil": true}, {"image_id": "COCO_val2014_000000227901", "CLIPScore": 0.71337890625, "foil": true}, {"image_id": "COCO_val2014_000000227960", "CLIPScore": 0.8193359375, "foil": false}, {"image_id": "COCO_val2014_000000466960", "CLIPScore": 0.66015625, "foil": false}, {"image_id": "COCO_val2014_000000245852", "CLIPScore": 0.806640625, "foil": false}, {"image_id": "COCO_val2014_000000129592", "CLIPScore": 0.8994140625, "foil": false}, {"image_id": "COCO_val2014_000000555648", "CLIPScore": 0.7490234375, "foil": false}, {"image_id": "COCO_val2014_000000229599", "CLIPScore": 0.8623046875, "foil": false}, {"image_id": "COCO_val2014_000000082465", "CLIPScore": 0.76220703125, "foil": true}, {"image_id": "COCO_val2014_000000249672", "CLIPScore": 0.8046875, "foil": false}, {"image_id": "COCO_val2014_000000441211", "CLIPScore": 0.744140625, "foil": false}, {"image_id": "COCO_val2014_000000481670", "CLIPScore": 0.75927734375, "foil": false}, {"image_id": "COCO_val2014_000000304741", "CLIPScore": 0.9541015625, "foil": true}, {"image_id": "COCO_val2014_000000534045", "CLIPScore": 0.87841796875, "foil": true}, {"image_id": "COCO_val2014_000000514586", "CLIPScore": 0.83544921875, "foil": true}, {"image_id": "COCO_val2014_000000523252", "CLIPScore": 0.75732421875, "foil": true}, {"image_id": "COCO_val2014_000000201301", "CLIPScore": 0.8974609375, "foil": true}, {"image_id": "COCO_val2014_000000191981", "CLIPScore": 0.728515625, "foil": false}, {"image_id": "COCO_val2014_000000179317", "CLIPScore": 0.83740234375, "foil": true}, {"image_id": "COCO_val2014_000000492800", "CLIPScore": 0.6865234375, "foil": true}, {"image_id": "COCO_val2014_000000077595", "CLIPScore": 0.8271484375, "foil": true}, {"image_id": "COCO_val2014_000000196594", "CLIPScore": 0.67626953125, "foil": true}, {"image_id": "COCO_val2014_000000000139", "CLIPScore": 0.70751953125, "foil": false}, {"image_id": "COCO_val2014_000000377832", "CLIPScore": 0.8564453125, "foil": true}, {"image_id": "COCO_val2014_000000018737", "CLIPScore": 0.826171875, "foil": false}, {"image_id": "COCO_val2014_000000212470", "CLIPScore": 0.892578125, "foil": true}, {"image_id": "COCO_val2014_000000356261", "CLIPScore": 0.88818359375, "foil": false}, {"image_id": "COCO_val2014_000000128570", "CLIPScore": 0.93310546875, "foil": false}, {"image_id": "COCO_val2014_000000007320", "CLIPScore": 0.779296875, "foil": false}, {"image_id": "COCO_val2014_000000392928", "CLIPScore": 0.818359375, "foil": false}, {"image_id": "COCO_val2014_000000066046", "CLIPScore": 0.94482421875, "foil": false}, {"image_id": "COCO_val2014_000000253282", "CLIPScore": 0.806640625, "foil": false}, {"image_id": "COCO_val2014_000000296303", "CLIPScore": 0.7265625, "foil": false}, {"image_id": "COCO_val2014_000000574592", "CLIPScore": 0.890625, "foil": false}, {"image_id": "COCO_val2014_000000273825", "CLIPScore": 0.8212890625, "foil": false}, {"image_id": "COCO_val2014_000000027805", "CLIPScore": 0.8544921875, "foil": false}, {"image_id": "COCO_val2014_000000236272", "CLIPScore": 0.68310546875, "foil": true}, {"image_id": "COCO_val2014_000000433998", "CLIPScore": 0.8515625, "foil": false}, {"image_id": "COCO_val2014_000000497141", "CLIPScore": 0.9169921875, "foil": false}, {"image_id": "COCO_val2014_000000518188", "CLIPScore": 0.7060546875, "foil": true}, {"image_id": "COCO_val2014_000000514979", "CLIPScore": 0.794921875, "foil": false}, {"image_id": "COCO_val2014_000000319687", "CLIPScore": 0.7412109375, "foil": false}, {"image_id": "COCO_val2014_000000261758", "CLIPScore": 0.8134765625, "foil": false}, {"image_id": "COCO_val2014_000000336568", "CLIPScore": 0.83447265625, "foil": false}, {"image_id": "COCO_val2014_000000028864", "CLIPScore": 0.888671875, "foil": false}, {"image_id": "COCO_val2014_000000566049", "CLIPScore": 0.83056640625, "foil": false}, {"image_id": "COCO_val2014_000000117676", "CLIPScore": 0.74951171875, "foil": false}, {"image_id": "COCO_val2014_000000128813", "CLIPScore": 0.88623046875, "foil": false}, {"image_id": "COCO_val2014_000000190432", "CLIPScore": 0.86865234375, "foil": false}, {"image_id": "COCO_val2014_000000101660", "CLIPScore": 0.8193359375, "foil": true}, {"image_id": "COCO_val2014_000000463785", "CLIPScore": 0.79052734375, "foil": false}, {"image_id": "COCO_val2014_000000410141", "CLIPScore": 0.6796875, "foil": false}, {"image_id": "COCO_val2014_000000237041", "CLIPScore": 0.802734375, "foil": false}, {"image_id": "COCO_val2014_000000443347", "CLIPScore": 0.802734375, "foil": false}, {"image_id": "COCO_val2014_000000276720", "CLIPScore": 0.6845703125, "foil": false}, {"image_id": "COCO_val2014_000000028850", "CLIPScore": 0.7099609375, "foil": false}, {"image_id": "COCO_val2014_000000500940", "CLIPScore": 0.87890625, "foil": false}, {"image_id": "COCO_val2014_000000314412", "CLIPScore": 0.88916015625, "foil": false}, {"image_id": "COCO_val2014_000000172201", "CLIPScore": 0.79150390625, "foil": false}, {"image_id": "COCO_val2014_000000232598", "CLIPScore": 0.8427734375, "foil": false}, {"image_id": "COCO_val2014_000000113113", "CLIPScore": 0.763671875, "foil": false}, {"image_id": "COCO_val2014_000000483401", "CLIPScore": 0.84228515625, "foil": false}, {"image_id": "COCO_val2014_000000032258", "CLIPScore": 0.779296875, "foil": false}, {"image_id": "COCO_val2014_000000158887", "CLIPScore": 0.90087890625, "foil": false}, {"image_id": "COCO_val2014_000000258523", "CLIPScore": 0.802734375, "foil": false}, {"image_id": "COCO_val2014_000000439770", "CLIPScore": 0.796875, "foil": true}, {"image_id": "COCO_val2014_000000217301", "CLIPScore": 0.89794921875, "foil": true}, {"image_id": "COCO_val2014_000000192905", "CLIPScore": 0.845703125, "foil": true}, {"image_id": "COCO_val2014_000000363577", "CLIPScore": 0.880859375, "foil": true}, {"image_id": "COCO_val2014_000000149568", "CLIPScore": 0.7060546875, "foil": true}, {"image_id": "COCO_val2014_000000127660", "CLIPScore": 0.91357421875, "foil": false}, {"image_id": "COCO_val2014_000000299493", "CLIPScore": 0.8154296875, "foil": false}, {"image_id": "COCO_val2014_000000293757", "CLIPScore": 0.82275390625, "foil": true}, {"image_id": "COCO_val2014_000000386912", "CLIPScore": 0.7099609375, "foil": false}, {"image_id": "COCO_val2014_000000451084", "CLIPScore": 0.814453125, "foil": false}, {"image_id": "COCO_val2014_000000376545", "CLIPScore": 0.794921875, "foil": false}, {"image_id": "COCO_val2014_000000327401", "CLIPScore": 0.81787109375, "foil": true}, {"image_id": "COCO_val2014_000000562614", "CLIPScore": 0.81982421875, "foil": false}, {"image_id": "COCO_val2014_000000366264", "CLIPScore": 0.6669921875, "foil": false}, {"image_id": "COCO_val2014_000000036450", "CLIPScore": 0.80810546875, "foil": false}, {"image_id": "COCO_val2014_000000202825", "CLIPScore": 0.89599609375, "foil": true}, {"image_id": "COCO_val2014_000000308506", "CLIPScore": 0.7373046875, "foil": false}, {"image_id": "COCO_val2014_000000511469", "CLIPScore": 0.88818359375, "foil": false}, {"image_id": "COCO_val2014_000000264191", "CLIPScore": 0.77099609375, "foil": false}, {"image_id": "COCO_val2014_000000528276", "CLIPScore": 0.87939453125, "foil": true}, {"image_id": "COCO_val2014_000000375415", "CLIPScore": 0.900390625, "foil": true}, {"image_id": "COCO_val2014_000000095677", "CLIPScore": 0.9755859375, "foil": false}, {"image_id": "COCO_val2014_000000043997", "CLIPScore": 0.8798828125, "foil": false}, {"image_id": "COCO_val2014_000000102577", "CLIPScore": 0.78369140625, "foil": false}, {"image_id": "COCO_val2014_000000139917", "CLIPScore": 0.98193359375, "foil": false}, {"image_id": "COCO_val2014_000000207561", "CLIPScore": 0.830078125, "foil": false}, {"image_id": "COCO_val2014_000000169331", "CLIPScore": 0.76953125, "foil": true}, {"image_id": "COCO_val2014_000000375415", "CLIPScore": 0.86181640625, "foil": false}, {"image_id": "COCO_val2014_000000538463", "CLIPScore": 0.7509765625, "foil": false}, {"image_id": "COCO_val2014_000000323925", "CLIPScore": 0.767578125, "foil": false}, {"image_id": "COCO_val2014_000000091615", "CLIPScore": 0.8037109375, "foil": false}, {"image_id": "COCO_val2014_000000543692", "CLIPScore": 0.783203125, "foil": false}, {"image_id": "COCO_val2014_000000362023", "CLIPScore": 0.69775390625, "foil": true}, {"image_id": "COCO_val2014_000000331250", "CLIPScore": 0.71728515625, "foil": true}, {"image_id": "COCO_val2014_000000528786", "CLIPScore": 0.78369140625, "foil": false}, {"image_id": "COCO_val2014_000000134596", "CLIPScore": 0.595703125, "foil": true}, {"image_id": "COCO_val2014_000000455741", "CLIPScore": 0.71826171875, "foil": true}, {"image_id": "COCO_val2014_000000431573", "CLIPScore": 0.94775390625, "foil": false}, {"image_id": "COCO_val2014_000000552901", "CLIPScore": 0.828125, "foil": false}, {"image_id": "COCO_val2014_000000050165", "CLIPScore": 0.791015625, "foil": true}, {"image_id": "COCO_val2014_000000473299", "CLIPScore": 0.80029296875, "foil": true}, {"image_id": "COCO_val2014_000000245145", "CLIPScore": 0.72705078125, "foil": true}, {"image_id": "COCO_val2014_000000004840", "CLIPScore": 0.7705078125, "foil": false}, {"image_id": "COCO_val2014_000000125208", "CLIPScore": 0.88623046875, "foil": false}, {"image_id": "COCO_val2014_000000515585", "CLIPScore": 0.814453125, "foil": true}, {"image_id": "COCO_val2014_000000322056", "CLIPScore": 0.7861328125, "foil": false}, {"image_id": "COCO_val2014_000000557172", "CLIPScore": 0.8291015625, "foil": false}, {"image_id": "COCO_val2014_000000169226", "CLIPScore": 0.80859375, "foil": false}, {"image_id": "COCO_val2014_000000290416", "CLIPScore": 0.7529296875, "foil": false}, {"image_id": "COCO_val2014_000000551633", "CLIPScore": 0.7822265625, "foil": true}, {"image_id": "COCO_val2014_000000311789", "CLIPScore": 0.783203125, "foil": true}, {"image_id": "COCO_val2014_000000208135", "CLIPScore": 0.7578125, "foil": false}, {"image_id": "COCO_val2014_000000137954", "CLIPScore": 0.77734375, "foil": false}, {"image_id": "COCO_val2014_000000091267", "CLIPScore": 0.81103515625, "foil": false}, {"image_id": "COCO_val2014_000000304741", "CLIPScore": 0.74658203125, "foil": false}, {"image_id": "COCO_val2014_000000460841", "CLIPScore": 0.837890625, "foil": true}, {"image_id": "COCO_val2014_000000291028", "CLIPScore": 0.8017578125, "foil": false}, {"image_id": "COCO_val2014_000000439290", "CLIPScore": 0.79150390625, "foil": false}, {"image_id": "COCO_val2014_000000441468", "CLIPScore": 0.80126953125, "foil": false}, {"image_id": "COCO_val2014_000000543570", "CLIPScore": 0.8115234375, "foil": false}, {"image_id": "COCO_val2014_000000472472", "CLIPScore": 0.80810546875, "foil": false}, {"image_id": "COCO_val2014_000000094379", "CLIPScore": 0.84033203125, "foil": true}, {"image_id": "COCO_val2014_000000381519", "CLIPScore": 0.83056640625, "foil": false}, {"image_id": "COCO_val2014_000000325958", "CLIPScore": 0.72802734375, "foil": true}, {"image_id": "COCO_val2014_000000109216", "CLIPScore": 0.8408203125, "foil": true}, {"image_id": "COCO_val2014_000000199510", "CLIPScore": 0.802734375, "foil": true}, {"image_id": "COCO_val2014_000000434829", "CLIPScore": 0.6845703125, "foil": false}, {"image_id": "COCO_val2014_000000066263", "CLIPScore": 0.8994140625, "foil": false}, {"image_id": "COCO_val2014_000000190760", "CLIPScore": 0.99072265625, "foil": false}, {"image_id": "COCO_val2014_000000229216", "CLIPScore": 0.833984375, "foil": true}, {"image_id": "COCO_val2014_000000429598", "CLIPScore": 0.8154296875, "foil": true}, {"image_id": "COCO_val2014_000000021232", "CLIPScore": 0.8115234375, "foil": false}, {"image_id": "COCO_val2014_000000130599", "CLIPScore": 0.7021484375, "foil": true}, {"image_id": "COCO_val2014_000000065306", "CLIPScore": 0.8896484375, "foil": false}, {"image_id": "COCO_val2014_000000547487", "CLIPScore": 0.85986328125, "foil": false}, {"image_id": "COCO_val2014_000000358149", "CLIPScore": 0.61474609375, "foil": false}, {"image_id": "COCO_val2014_000000017959", "CLIPScore": 0.923828125, "foil": true}, {"image_id": "COCO_val2014_000000310902", "CLIPScore": 0.8017578125, "foil": false}, {"image_id": "COCO_val2014_000000160004", "CLIPScore": 0.83056640625, "foil": true}, {"image_id": "COCO_val2014_000000538064", "CLIPScore": 0.8193359375, "foil": false}, {"image_id": "COCO_val2014_000000125997", "CLIPScore": 1.04296875, "foil": false}, {"image_id": "COCO_val2014_000000002255", "CLIPScore": 0.853515625, "foil": true}, {"image_id": "COCO_val2014_000000153734", "CLIPScore": 0.837890625, "foil": false}, {"image_id": "COCO_val2014_000000371243", "CLIPScore": 0.88134765625, "foil": false}, {"image_id": "COCO_val2014_000000544237", "CLIPScore": 0.7001953125, "foil": false}, {"image_id": "COCO_val2014_000000002495", "CLIPScore": 0.853515625, "foil": false}, {"image_id": "COCO_val2014_000000498381", "CLIPScore": 0.92041015625, "foil": false}, {"image_id": "COCO_val2014_000000541550", "CLIPScore": 0.85205078125, "foil": true}, {"image_id": "COCO_val2014_000000303926", "CLIPScore": 0.83203125, "foil": true}, {"image_id": "COCO_val2014_000000115776", "CLIPScore": 0.71240234375, "foil": false}, {"image_id": "COCO_val2014_000000388927", "CLIPScore": 0.650390625, "foil": false}, {"image_id": "COCO_val2014_000000299987", "CLIPScore": 0.92578125, "foil": false}, {"image_id": "COCO_val2014_000000058225", "CLIPScore": 0.7841796875, "foil": false}, {"image_id": "COCO_val2014_000000501494", "CLIPScore": 0.8818359375, "foil": false}, {"image_id": "COCO_val2014_000000457453", "CLIPScore": 0.7705078125, "foil": false}, {"image_id": "COCO_val2014_000000114871", "CLIPScore": 0.87109375, "foil": false}, {"image_id": "COCO_val2014_000000005728", "CLIPScore": 0.97607421875, "foil": true}, {"image_id": "COCO_val2014_000000579602", "CLIPScore": 0.7822265625, "foil": true}, {"image_id": "COCO_val2014_000000322509", "CLIPScore": 0.87109375, "foil": false}, {"image_id": "COCO_val2014_000000461573", "CLIPScore": 0.77197265625, "foil": true}, {"image_id": "COCO_val2014_000000135155", "CLIPScore": 0.79345703125, "foil": false}, {"image_id": "COCO_val2014_000000249658", "CLIPScore": 0.76708984375, "foil": false}, {"image_id": "COCO_val2014_000000004678", "CLIPScore": 0.76708984375, "foil": false}, {"image_id": "COCO_val2014_000000079331", "CLIPScore": 0.93994140625, "foil": false}, {"image_id": "COCO_val2014_000000255769", "CLIPScore": 0.77392578125, "foil": false}, {"image_id": "COCO_val2014_000000002495", "CLIPScore": 0.814453125, "foil": true}, {"image_id": "COCO_val2014_000000342593", "CLIPScore": 0.955078125, "foil": false}, {"image_id": "COCO_val2014_000000257328", "CLIPScore": 0.8359375, "foil": true}, {"image_id": "COCO_val2014_000000451275", "CLIPScore": 0.8173828125, "foil": false}, {"image_id": "COCO_val2014_000000110265", "CLIPScore": 0.76708984375, "foil": false}, {"image_id": "COCO_val2014_000000121014", "CLIPScore": 0.70361328125, "foil": true}, {"image_id": "COCO_val2014_000000386032", "CLIPScore": 0.85009765625, "foil": false}, {"image_id": "COCO_val2014_000000138639", "CLIPScore": 0.7890625, "foil": true}, {"image_id": "COCO_val2014_000000380487", "CLIPScore": 0.75, "foil": false}, {"image_id": "COCO_val2014_000000221571", "CLIPScore": 0.9033203125, "foil": false}, {"image_id": "COCO_val2014_000000337984", "CLIPScore": 0.783203125, "foil": false}, {"image_id": "COCO_val2014_000000012959", "CLIPScore": 0.91259765625, "foil": false}, {"image_id": "COCO_val2014_000000514979", "CLIPScore": 0.7802734375, "foil": false}, {"image_id": "COCO_val2014_000000199688", "CLIPScore": 0.78662109375, "foil": false}, {"image_id": "COCO_val2014_000000575174", "CLIPScore": 0.78173828125, "foil": false}, {"image_id": "COCO_val2014_000000440528", "CLIPScore": 0.740234375, "foil": false}, {"image_id": "COCO_val2014_000000564355", "CLIPScore": 0.7822265625, "foil": true}, {"image_id": "COCO_val2014_000000351875", "CLIPScore": 0.705078125, "foil": false}, {"image_id": "COCO_val2014_000000437049", "CLIPScore": 0.78564453125, "foil": false}, {"image_id": "COCO_val2014_000000543409", "CLIPScore": 0.78076171875, "foil": true}, {"image_id": "COCO_val2014_000000198163", "CLIPScore": 0.67626953125, "foil": true}, {"image_id": "COCO_val2014_000000158583", "CLIPScore": 0.6337890625, "foil": true}, {"image_id": "COCO_val2014_000000124390", "CLIPScore": 0.80322265625, "foil": false}, {"image_id": "COCO_val2014_000000192192", "CLIPScore": 0.828125, "foil": false}, {"image_id": "COCO_val2014_000000155192", "CLIPScore": 0.79345703125, "foil": false}, {"image_id": "COCO_val2014_000000279386", "CLIPScore": 0.88916015625, "foil": false}, {"image_id": "COCO_val2014_000000407826", "CLIPScore": 0.79150390625, "foil": false}, {"image_id": "COCO_val2014_000000520273", "CLIPScore": 0.7392578125, "foil": false}, {"image_id": "COCO_val2014_000000538394", "CLIPScore": 0.85986328125, "foil": true}, {"image_id": "COCO_val2014_000000387833", "CLIPScore": 0.74462890625, "foil": false}, {"image_id": "COCO_val2014_000000278321", "CLIPScore": 0.73779296875, "foil": false}, {"image_id": "COCO_val2014_000000412621", "CLIPScore": 0.8359375, "foil": false}, {"image_id": "COCO_val2014_000000139623", "CLIPScore": 0.744140625, "foil": false}, {"image_id": "COCO_val2014_000000509577", "CLIPScore": 0.68310546875, "foil": true}, {"image_id": "COCO_val2014_000000422017", "CLIPScore": 0.8271484375, "foil": true}, {"image_id": "COCO_val2014_000000110231", "CLIPScore": 0.787109375, "foil": true}, {"image_id": "COCO_val2014_000000117759", "CLIPScore": 0.87939453125, "foil": false}, {"image_id": "COCO_val2014_000000083573", "CLIPScore": 0.82568359375, "foil": true}, {"image_id": "COCO_val2014_000000413043", "CLIPScore": 0.88623046875, "foil": false}, {"image_id": "COCO_val2014_000000437564", "CLIPScore": 0.74462890625, "foil": true}, {"image_id": "COCO_val2014_000000490366", "CLIPScore": 0.8154296875, "foil": false}, {"image_id": "COCO_val2014_000000007207", "CLIPScore": 0.8251953125, "foil": true}, {"image_id": "COCO_val2014_000000455044", "CLIPScore": 0.88134765625, "foil": false}, {"image_id": "COCO_val2014_000000475043", "CLIPScore": 0.72802734375, "foil": false}, {"image_id": "COCO_val2014_000000041369", "CLIPScore": 0.71484375, "foil": true}, {"image_id": "COCO_val2014_000000255149", "CLIPScore": 0.8427734375, "foil": true}, {"image_id": "COCO_val2014_000000066046", "CLIPScore": 0.98046875, "foil": true}, {"image_id": "COCO_val2014_000000184613", "CLIPScore": 0.869140625, "foil": false}, {"image_id": "COCO_val2014_000000489550", "CLIPScore": 0.77197265625, "foil": false}, {"image_id": "COCO_val2014_000000309571", "CLIPScore": 0.6962890625, "foil": true}, {"image_id": "COCO_val2014_000000516026", "CLIPScore": 0.90234375, "foil": true}, {"image_id": "COCO_val2014_000000029444", "CLIPScore": 0.80029296875, "foil": true}, {"image_id": "COCO_val2014_000000019306", "CLIPScore": 0.88623046875, "foil": true}, {"image_id": "COCO_val2014_000000511236", "CLIPScore": 0.9072265625, "foil": false}, {"image_id": "COCO_val2014_000000056302", "CLIPScore": 0.751953125, "foil": true}, {"image_id": "COCO_val2014_000000512416", "CLIPScore": 0.8681640625, "foil": true}, {"image_id": "COCO_val2014_000000258905", "CLIPScore": 0.6796875, "foil": false}, {"image_id": "COCO_val2014_000000073622", "CLIPScore": 0.6064453125, "foil": false}, {"image_id": "COCO_val2014_000000469030", "CLIPScore": 0.80810546875, "foil": false}, {"image_id": "COCO_val2014_000000115069", "CLIPScore": 0.865234375, "foil": false}, {"image_id": "COCO_val2014_000000419560", "CLIPScore": 0.837890625, "foil": false}, {"image_id": "COCO_val2014_000000354744", "CLIPScore": 0.82080078125, "foil": false}, {"image_id": "COCO_val2014_000000378244", "CLIPScore": 0.81640625, "foil": false}, {"image_id": "COCO_val2014_000000527022", "CLIPScore": 0.69384765625, "foil": true}, {"image_id": "COCO_val2014_000000016161", "CLIPScore": 0.865234375, "foil": false}, {"image_id": "COCO_val2014_000000569030", "CLIPScore": 0.837890625, "foil": true}, {"image_id": "COCO_val2014_000000187852", "CLIPScore": 0.8349609375, "foil": false}, {"image_id": "COCO_val2014_000000100624", "CLIPScore": 0.9306640625, "foil": false}, {"image_id": "COCO_val2014_000000092771", "CLIPScore": 0.84765625, "foil": false}, {"image_id": "COCO_val2014_000000425870", "CLIPScore": 0.689453125, "foil": true}, {"image_id": "COCO_val2014_000000268229", "CLIPScore": 0.77392578125, "foil": true}, {"image_id": "COCO_val2014_000000233848", "CLIPScore": 0.779296875, "foil": false}, {"image_id": "COCO_val2014_000000011760", "CLIPScore": 0.89306640625, "foil": false}, {"image_id": "COCO_val2014_000000249227", "CLIPScore": 0.73828125, "foil": false}, {"image_id": "COCO_val2014_000000046345", "CLIPScore": 0.759765625, "foil": false}, {"image_id": "COCO_val2014_000000033697", "CLIPScore": 0.8203125, "foil": false}, {"image_id": "COCO_val2014_000000097659", "CLIPScore": 0.87939453125, "foil": false}, {"image_id": "COCO_val2014_000000257137", "CLIPScore": 0.9443359375, "foil": false}, {"image_id": "COCO_val2014_000000413287", "CLIPScore": 0.97119140625, "foil": false}, {"image_id": "COCO_val2014_000000477750", "CLIPScore": 0.72802734375, "foil": false}, {"image_id": "COCO_val2014_000000550432", "CLIPScore": 0.71826171875, "foil": false}, {"image_id": "COCO_val2014_000000486905", "CLIPScore": 0.96484375, "foil": false}, {"image_id": "COCO_val2014_000000352789", "CLIPScore": 0.849609375, "foil": false}, {"image_id": "COCO_val2014_000000172649", "CLIPScore": 0.7734375, "foil": false}, {"image_id": "COCO_val2014_000000101828", "CLIPScore": 0.85546875, "foil": false}, {"image_id": "COCO_val2014_000000172553", "CLIPScore": 0.72314453125, "foil": true}, {"image_id": "COCO_val2014_000000139917", "CLIPScore": 0.779296875, "foil": true}, {"image_id": "COCO_val2014_000000395364", "CLIPScore": 0.6875, "foil": true}, {"image_id": "COCO_val2014_000000351133", "CLIPScore": 0.90380859375, "foil": false}, {"image_id": "COCO_val2014_000000548500", "CLIPScore": 0.8349609375, "foil": true}, {"image_id": "COCO_val2014_000000372070", "CLIPScore": 0.68310546875, "foil": false}, {"image_id": "COCO_val2014_000000360772", "CLIPScore": 0.703125, "foil": false}, {"image_id": "COCO_val2014_000000024144", "CLIPScore": 0.87939453125, "foil": false}, {"image_id": "COCO_val2014_000000083573", "CLIPScore": 0.828125, "foil": false}, {"image_id": "COCO_val2014_000000318645", "CLIPScore": 0.7724609375, "foil": true}, {"image_id": "COCO_val2014_000000350668", "CLIPScore": 0.79150390625, "foil": true}, {"image_id": "COCO_val2014_000000340559", "CLIPScore": 1.013671875, "foil": true}, {"image_id": "COCO_val2014_000000081782", "CLIPScore": 0.822265625, "foil": true}, {"image_id": "COCO_val2014_000000296404", "CLIPScore": 0.8125, "foil": false}, {"image_id": "COCO_val2014_000000220732", "CLIPScore": 0.74169921875, "foil": true}, {"image_id": "COCO_val2014_000000569415", "CLIPScore": 0.8408203125, "foil": false}, {"image_id": "COCO_val2014_000000117563", "CLIPScore": 0.7109375, "foil": false}, {"image_id": "COCO_val2014_000000125208", "CLIPScore": 0.85009765625, "foil": true}, {"image_id": "COCO_val2014_000000030012", "CLIPScore": 0.6943359375, "foil": false}, {"image_id": "COCO_val2014_000000395463", "CLIPScore": 0.66455078125, "foil": false}, {"image_id": "COCO_val2014_000000389316", "CLIPScore": 0.7255859375, "foil": true}, {"image_id": "COCO_val2014_000000255769", "CLIPScore": 0.720703125, "foil": false}, {"image_id": "COCO_val2014_000000031748", "CLIPScore": 0.76611328125, "foil": false}, {"image_id": "COCO_val2014_000000297374", "CLIPScore": 0.82080078125, "foil": false}, {"image_id": "COCO_val2014_000000310902", "CLIPScore": 0.7861328125, "foil": false}, {"image_id": "COCO_val2014_000000014248", "CLIPScore": 0.8359375, "foil": false}, {"image_id": "COCO_val2014_000000444491", "CLIPScore": 0.7939453125, "foil": true}, {"image_id": "COCO_val2014_000000474465", "CLIPScore": 0.88134765625, "foil": false}, {"image_id": "COCO_val2014_000000049682", "CLIPScore": 0.80859375, "foil": false}, {"image_id": "COCO_val2014_000000139917", "CLIPScore": 0.8330078125, "foil": false}, {"image_id": "COCO_val2014_000000495376", "CLIPScore": 0.7451171875, "foil": false}, {"image_id": "COCO_val2014_000000559277", "CLIPScore": 0.85400390625, "foil": false}, {"image_id": "COCO_val2014_000000360182", "CLIPScore": 0.84521484375, "foil": true}, {"image_id": "COCO_val2014_000000120860", "CLIPScore": 0.87109375, "foil": true}, {"image_id": "COCO_val2014_000000226592", "CLIPScore": 0.89306640625, "foil": false}, {"image_id": "COCO_val2014_000000233005", "CLIPScore": 0.67578125, "foil": true}, {"image_id": "COCO_val2014_000000468736", "CLIPScore": 0.86376953125, "foil": false}, {"image_id": "COCO_val2014_000000034869", "CLIPScore": 0.76708984375, "foil": false}, {"image_id": "COCO_val2014_000000179045", "CLIPScore": 0.65185546875, "foil": false}, {"image_id": "COCO_val2014_000000136846", "CLIPScore": 0.7421875, "foil": false}, {"image_id": "COCO_val2014_000000189213", "CLIPScore": 0.78515625, "foil": false}, {"image_id": "COCO_val2014_000000435358", "CLIPScore": 0.76171875, "foil": false}, {"image_id": "COCO_val2014_000000207056", "CLIPScore": 0.921875, "foil": false}, {"image_id": "COCO_val2014_000000276146", "CLIPScore": 0.833984375, "foil": false}, {"image_id": "COCO_val2014_000000251627", "CLIPScore": 0.7109375, "foil": true}, {"image_id": "COCO_val2014_000000332113", "CLIPScore": 0.791015625, "foil": true}, {"image_id": "COCO_val2014_000000560993", "CLIPScore": 0.921875, "foil": false}, {"image_id": "COCO_val2014_000000217827", "CLIPScore": 0.76953125, "foil": true}, {"image_id": "COCO_val2014_000000186009", "CLIPScore": 0.892578125, "foil": false}, {"image_id": "COCO_val2014_000000327436", "CLIPScore": 0.7158203125, "foil": false}, {"image_id": "COCO_val2014_000000419309", "CLIPScore": 0.5986328125, "foil": false}, {"image_id": "COCO_val2014_000000518914", "CLIPScore": 0.78173828125, "foil": false}, {"image_id": "COCO_val2014_000000226097", "CLIPScore": 0.787109375, "foil": true}, {"image_id": "COCO_val2014_000000004108", "CLIPScore": 0.62548828125, "foil": true}, {"image_id": "COCO_val2014_000000282150", "CLIPScore": 0.8818359375, "foil": false}, {"image_id": "COCO_val2014_000000149197", "CLIPScore": 0.83056640625, "foil": false}, {"image_id": "COCO_val2014_000000232654", "CLIPScore": 0.80078125, "foil": true}, {"image_id": "COCO_val2014_000000147173", "CLIPScore": 0.8095703125, "foil": true}, {"image_id": "COCO_val2014_000000211743", "CLIPScore": 0.78662109375, "foil": true}, {"image_id": "COCO_val2014_000000455610", "CLIPScore": 0.6787109375, "foil": false}, {"image_id": "COCO_val2014_000000358642", "CLIPScore": 0.77587890625, "foil": true}, {"image_id": "COCO_val2014_000000218470", "CLIPScore": 0.7177734375, "foil": false}, {"image_id": "COCO_val2014_000000157767", "CLIPScore": 0.62451171875, "foil": true}, {"image_id": "COCO_val2014_000000234676", "CLIPScore": 0.75634765625, "foil": false}, {"image_id": "COCO_val2014_000000239355", "CLIPScore": 0.74267578125, "foil": true}, {"image_id": "COCO_val2014_000000327918", "CLIPScore": 0.6962890625, "foil": false}, {"image_id": "COCO_val2014_000000044621", "CLIPScore": 0.7509765625, "foil": true}, {"image_id": "COCO_val2014_000000017655", "CLIPScore": 0.8203125, "foil": false}, {"image_id": "COCO_val2014_000000005124", "CLIPScore": 0.8115234375, "foil": false}, {"image_id": "COCO_val2014_000000029573", "CLIPScore": 0.76904296875, "foil": false}, {"image_id": "COCO_val2014_000000258402", "CLIPScore": 0.8046875, "foil": true}, {"image_id": "COCO_val2014_000000056288", "CLIPScore": 0.93701171875, "foil": false}, {"image_id": "COCO_val2014_000000273825", "CLIPScore": 0.7919921875, "foil": false}, {"image_id": "COCO_val2014_000000076619", "CLIPScore": 0.73291015625, "foil": true}, {"image_id": "COCO_val2014_000000532481", "CLIPScore": 0.81298828125, "foil": false}, {"image_id": "COCO_val2014_000000509867", "CLIPScore": 0.71533203125, "foil": true}, {"image_id": "COCO_val2014_000000255338", "CLIPScore": 0.83984375, "foil": false}, {"image_id": "COCO_val2014_000000125850", "CLIPScore": 0.73779296875, "foil": true}, {"image_id": "COCO_val2014_000000131593", "CLIPScore": 0.7548828125, "foil": true}, {"image_id": "COCO_val2014_000000564629", "CLIPScore": 0.7822265625, "foil": false}, {"image_id": "COCO_val2014_000000268092", "CLIPScore": 0.91064453125, "foil": true}, {"image_id": "COCO_val2014_000000441468", "CLIPScore": 0.771484375, "foil": true}, {"image_id": "COCO_val2014_000000548957", "CLIPScore": 0.6513671875, "foil": false}, {"image_id": "COCO_val2014_000000203878", "CLIPScore": 0.75634765625, "foil": false}, {"image_id": "COCO_val2014_000000423256", "CLIPScore": 0.7900390625, "foil": true}, {"image_id": "COCO_val2014_000000519094", "CLIPScore": 0.8681640625, "foil": false}, {"image_id": "COCO_val2014_000000061773", "CLIPScore": 0.67578125, "foil": false}, {"image_id": "COCO_val2014_000000466787", "CLIPScore": 0.8056640625, "foil": false}, {"image_id": "COCO_val2014_000000337533", "CLIPScore": 0.87939453125, "foil": false}, {"image_id": "COCO_val2014_000000412586", "CLIPScore": 0.8515625, "foil": false}, {"image_id": "COCO_val2014_000000293071", "CLIPScore": 0.94140625, "foil": false}, {"image_id": "COCO_val2014_000000304305", "CLIPScore": 0.818359375, "foil": false}, {"image_id": "COCO_val2014_000000483893", "CLIPScore": 0.76953125, "foil": true}, {"image_id": "COCO_val2014_000000399164", "CLIPScore": 0.77099609375, "foil": true}, {"image_id": "COCO_val2014_000000435309", "CLIPScore": 0.69580078125, "foil": false}, {"image_id": "COCO_val2014_000000142000", "CLIPScore": 0.779296875, "foil": true}]```
Ah, it does look like it may not be entirely deterministic -- I can't remember if I ran these experiments initially or one of the other team members (@spetryk or Anish) ran the clip-score experiments since it was a benchmark method (and not our ALOHa method). I've attached my environment.yml file from Conda, but we didn't pin the versions between team members, so there's no guarantee that this is the exact conda version set.
I've also attached the full set of archived results I have on HAT (which I think are the ones we used in the paper, but @spetryk compiled the final results so I'm not absolutely certain).
Thank you for detailed response! 1) With your json, I can reproduce score reported in paper with sklearn, so it is not reason from metric but CLIPScore itself
import json
with open("../data/clipscore.json", "r") as f:
output = json.load(f)
clips = []
foils = []
for sample in output:
clips.append(-sample["CLIPScore"]["CLIPScore"])
foils.append(int(sample["contains_hallucination"]))
from sklearn.metrics import average_precision_score
average_precision_score(foils, clips)
0.400964714203247
2) I found some sample is significantly different(0.16) while dependancy shows at most 0.03 difference.
{"image_id": "COCO_val2014_000000016903", "CLIPScore": 0.8759765625, "foil": true}
[{"image_id": "COCO_val2014_000000016903.jpg", "contains_hallucination": true, "CLIPScore": {"CLIPScore": 0.68603515625, "RefCLIPScore": 0.7177734375}}
Is there any possibility json you provided is result of RN50x64
or HAT dataset has changed?
3) I checked my environment with CLIPScore repo, and show exactly same value
> python clipscore.py example/good_captions.json example/images/
...
CLIPScore: 0.8584
Also, I checked my environment with another repo which reports CLIPScore, and its value is exactly same with my env. I check your docker yaml, but cannot find significant different packages (specifically torch, torchvision, Pillow)..
4) Lastly, can you assure that prefix A photo depicts
used for outputting result? I found without prefix it shows closer result with one reported.
It's good that you're able to reproduce the summarized results in the paper with our outputs.
We used the code in our repo for computing CLIPscore -- so if it's not present in our repo, then we didn't use it in the reproduction of the numbers. This likely means that we (likely) used the RN50x64 model without the prefix (indeed, I wasn't even aware that a forced prefix was a part of the original codebase -- and is probably at least part of the cause of the discrepancy, since many of the sentences in HAT already have a similar prefix, which could lead to influent sentences if an unwarranted prefix is added).
Edit: I looked a bit closer at the code in our repo, and it looks like the prefix is still present (we drew our base code from the original repo) in a default argument. You can run our code with the evaluator here: https://github.com/DavidMChan/aloha/blob/e38d69e0004a044254cef2641985c7ae4e01efd4/src/aloha/metrics/clipscore.py#L210
I wonder - can you reproduce the CLIP scores with our version of the code? You can do so by running:
aloha evaluate_dataset -m clipscore path/to/dataset.json
To run aloha cmd, there were some issuses
1) Maybe only for my env, but SentenceTransformer Trainer does not correspond with other dependancy, so I should have comment out all packages which import transformers.trainer
in SentenceTransformer package.
2) cmd should be fix for aloha evaluate-dataset
not evaluate_dataset
Try 'aloha --help' for help.
Error: No such command 'evaluate_dataset'.
Usage: aloha [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
evaluate-dataset
3) CLIPScoreMetrics does not have evaluate_dataset method, so I was not able to run in this way.
return __callback(*args, **kwargs)
File "/home/nsml/.local/lib/python3.8/site-packages/aloha/dataset.py", line 93, in evaluate_dataset
_mf = _mf()
TypeError: Can't instantiate abstract class CLIPScoreMetrics with abstract methods evaluate_dataset
4) When I try to import CLIPScoreMetrics class in python, I was not able to import class with same error
>>> from aloha.metrics import ALOHa, CLIPScoreMetrics
2024-07-17 01:53:02.233563: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /home/nsml/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
>>> evaluator = CLIPScoreMetrics()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Can't instantiate abstract class CLIPScoreMetrics with abstract methods evaluate_dataset
CLIPScore is based on ViT-B/32
, so I think it should be fixed if reported score is based on RN50x64
. In my environment, it scores 38.97
, which is lower than how it reported, so it should not be a big issue.
It would be greatly helpful to fix this repo to make able to evaluate CLIPScore, and can get result in your environment.
Thanks for the heads up on this! I'll fix these things in the repo (hopefully before early next week, and circle back when the commits are made). I recall we ran several variants of CLIPScore in order to get the best possible CLIPScore results - so that's likely the reason that RN50x64 was used instead. We can update the repo to indicate this.
Thanks a lot for your support! It would be helpful to community if you both report RN50x64 and ViT-B/32. Look forward to hear you back.
I just pushed a bug fix commit here: https://github.com/DavidMChan/aloha/commit/7da6b90ebe392228ea532a84d78eab830c2b3cb2
Can you please try this, and see if you're still getting divergent numbers?
With your revision version, the result corresponds with my result (38.97
!)
So the result divergence was from backbone (ViT-B/32
vsRN50x64
)
It would be helpful to community to add footnote on your paper that the result is backbone RN50. Otherwise, people would think it as ViT-B/32 variant.
Thank you so much!
Hmm, interesting! Thanks for pointing out this discrepancy, it's interesting to see that CLIPScore is even worse than expected. I'll ping @spetryk to update.
I try to reproduce table 1 CLIPScore with HAT dataset. I run with original CLIPScore repo, and it returns this. (slightly change to return foil at the same time)
And then I run this snippet, which is not provided in this aloha repo.
and it returns
38.97
while Table 1 CLIPScore AP shows40.10
. Can you help which part I missed?