Closed fauconnier closed 1 year ago
Hi @fauconnier ! Thanks for pointing this out. Let me check it what's going on: this example is not the intended behavior.
Taking some public notes... This is how we do the image text assignments:
def get_image_assignments(im2txt):
'''
returns a list assignments of length N_images such that assignments[i] is the sentence index that image i was assigned to.
'''
# if there are more images than texts, not quite sure what to do...
im_idxs_s, txt_idxs_s, sol = linear_assignment.base_solve(-im2txt)
im2txt_idxs = {im_idxs_s[k]: txt_idxs_s[k] for k in range(len(im_idxs_s))}
if im2txt.shape[0] > im2txt.shape[1]:
# there are more images than sentences. we dont want to discard images. so, for unassigned images, we will put them with their corresponding max.
for imidx in range(len(im2txt)):
if imidx not in im2txt_idxs:
im2txt_idxs[imidx] = int(np.argmax(im2txt[imidx]))
return [im2txt_idxs[idx] for idx in range(len(im2txt_idxs))]
where the base solve function is:
def base_solve(W, max_dummy_cost_value=1000):
'''
Gives hungarian solve for a non-square matrix. it's roughly from:
NOTE: this ** MINIMIZES COST **. So, if you're handing sims, make sure to negate them!
https://github.com/jmhessel/multi-retrieval/blob/master/bipartite_utils.py
returns i_s, j_s, cost such that:
for i, j in zip(i_s, j_s)
are the (i, j) row column entries selected.
cost is sum( cost[i, j] for i, j in zip(i_s, j_s) )
'''
if np.sum(np.abs(W)) > max_dummy_cost_value:
print('Warning, you values in your matrix may be too big, please raise max_dummy_cost_value')
orig_shape = W.shape
if orig_shape[0] != orig_shape[1]:
if orig_shape[0] > orig_shape[1]:
pad_idxs = [[0, 0], [0, W.shape[0]-W.shape[1]]]
col_pad = True
else:
pad_idxs = [[0, W.shape[1]-W.shape[0]], [0, 0]]
col_pad = False
W = np.pad(W, pad_idxs, 'constant', constant_values=max_dummy_cost_value)
sol, _, cost = lapjv(W)
i_s = np.arange(len(sol))
j_s = sol[i_s]
sort_idxs = np.argsort(-W[i_s, j_s])
i_s, j_s = map(lambda x: x[sort_idxs], [i_s, j_s])
if orig_shape[0] != orig_shape[1]:
if col_pad:
valid_idxs = np.where(j_s < orig_shape[1])[0]
else:
valid_idxs = np.where(i_s < orig_shape[0])[0]
i_s, j_s = i_s[valid_idxs], j_s[valid_idxs]
# indices = np.hstack([np.expand_dims(i_s, -1), np.expand_dims(j_s, -1)]).astype(np.int32)
m_cost = 0.0
for i, j in zip(i_s, j_s):
m_cost += W[i, j]
return i_s, j_s, m_cost
When I run that similarity matrix through this code, I get what I believe to be the correct assignment:
im txt sim
0 0 0.33591771125793457
1 1 0.27460938692092896
2 2 0.15680742263793945
3 3 0.237198144197464
But, this is not reflected in the matched_text_index
field or the matched_sim
field. I'll need to take a closer look at this shortly.
Hi @fauconnier --- I tracked down the bug! Thanks for reporting it. We didn't notice it because it only affects a subset of documents, and the issue was hard to track down. Here's what's going on:
10063
, in the full case at most 398/4529 (8.79%)
documents have an incorrect assignment, and for the mmc4 core documents in the same shard, at most 48/350 (13.71%)
documents have an incorrect assignment.matched_text_index
field, we used ==
(the equality test operator) instead of =
, the assignment operator. Looking through the code, the only fields that should be impacted are matched_text_index
and matched_sim
.Thanks for helping us track this down! I can handle the updates and next steps from here --- I'll keep you posted once I've fixed everything.
More updates:
I added the script we used to compute assignments, which will now save the correct assignment.
https://github.com/allenai/mmc4/blob/main/scripts/compute_assignments.py
as soon as I can, I will run this over the whole database and update the mmc4 documents in-place.
Fantastic. Thank you @jmhessel for the quick turnaround!
Hi @fauconnier , I wrote up the fix and am running it over everything. v1.1 of the corpus will be out ASAP. here's what this doc looks like in the new version :-)
{'image_info': [{'face_detections': None,
'image_name': '202706b24ac4.png',
'matched_sim': 0.33591771125793457,
'matched_text_index': 0,
'raw_url': 'https://st.hzcdn.com/static/badge_16_8@2x.png'},
{'face_detections': None,
'image_name': 'a97c58871c38.png',
'matched_sim': 0.27460938692092896,
'matched_text_index': 1,
'raw_url': 'https://st.hzcdn.com/static/badge_34_9@2x.png'},
{'face_detections': None,
'image_name': 'ce3c9aa070ce.png',
'matched_sim': 0.15680742263793945,
'matched_text_index': 2,
'raw_url': 'https://st.hzcdn.com/static/badge_19_9@2x.png'},
{'face_detections': None,
'image_name': 'c22425c7d977.png',
'matched_sim': 0.237198144197464,
'matched_text_index': 3,
'raw_url': 'https://st.hzcdn.com/static/badge_20_9@2x.png'}],
Hey @jmhessel, what is the ETA for this? I hope everything is running smoothly :)
Thanks so much for the prompt response on this!
Hi Alessandro --- hope you're well! actually, the new versions are ready and passing all of my known checks. I was about to push the release, but then I saw https://github.com/allenai/mmc4/issues/12 , so I am checking on that now
Marking this as resolved by https://github.com/allenai/mmc4/pull/13
Dear authors,
Thanks for releasing MMC4.
In the paper the following is stated:
However, we found examples where multiple images are aligned to a text span. For instance, consider the following example in
./docs_shard_10063_v3.jsonl
.Is that intended?
Thanks for any pointers.