allenai / mmc4

MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.
MIT License
887 stars 33 forks source link

Multiple images for an identical matched_text_index #11

Closed fauconnier closed 1 year ago

fauconnier commented 1 year ago

Dear authors,

Thanks for releasing MMC4.

In the paper the following is stated:

"we use [14] to compute a bipartite assignment of images to sentences, under the constraint that each sentence can only be assigned a single image." , "For documents with more images than sentences, after assigning an image to each sentence, we assign according to max similarity.".

However, we found examples where multiple images are aligned to a text span. For instance, consider the following example in ./docs_shard_10063_v3.jsonl.

{
    "url": "http://easydesigns.biz/easydesigns-wins-3rd-consecutive-best-of-houzz-award/",
    "text_list": [
        "cherry hill, nj, january 19, 2015 \u2013 easydesigns of cherry hill, nj has been awarded \u201cbest of houzz\u201d for customer satisfaction by houzz, the leading platform for home remodeling and design.",
        "the interior design and real estate staging firm, in business since 2005, was chosen by the more than 25 million monthly unique users that comprise the houzz community from among more than 500,000 active home building, remodeling and design industry professionals.",
        "\u201ci am so happy to be selected for the 3rd consecutive year.",
        "customer satisfaction is a primary goal of my firm so i am thrilled to be recognized by such a large and prominent community\u201d, said beth secosky, owner of easydesigns."
    ],
    "image_info": [
        {
            "image_name": "202706b24ac4.png",
            "raw_url": "https://st.hzcdn.com/static/badge_16_8@2x.png",
            "matched_text_index": 0,
            "matched_sim": 0.33591771125793457,
            "face_detections": null
        },
        {
            "image_name": "a97c58871c38.png",
            "raw_url": "https://st.hzcdn.com/static/badge_34_9@2x.png",
            "matched_text_index": 0,
            "matched_sim": 0.31495240330696106,
            "face_detections": null
        },
        {
            "image_name": "ce3c9aa070ce.png",
            "raw_url": "https://st.hzcdn.com/static/badge_19_9@2x.png",
            "matched_text_index": 1,
            "matched_sim": 0.2770630717277527,
            "face_detections": null
        },
        {
            "image_name": "c22425c7d977.png",
            "raw_url": "https://st.hzcdn.com/static/badge_20_9@2x.png",
            "matched_text_index": 0,
            "matched_sim": 0.3448386490345001,
            "face_detections": null
        }
    ],
    "similarity_matrix": [
        [
            0.33591771125793457,
            0.2377069592475891,
            0.17204634845256805,
            0.22403109073638916
        ],
        [
            0.31495240330696106,
            0.27460938692092896,
            0.12367681413888931,
            0.17759563028812408
        ],
        [
            0.3045308589935303,
            0.2770630717277527,
            0.15680742263793945,
            0.21054978668689728
        ],
        [
            0.3448386490345001,
            0.26175469160079956,
            0.16365793347358704,
            0.237198144197464
        ]
    ]
}

Is that intended?

Thanks for any pointers.

jmhessel commented 1 year ago

Hi @fauconnier ! Thanks for pointing this out. Let me check it what's going on: this example is not the intended behavior.

jmhessel commented 1 year ago

Taking some public notes... This is how we do the image text assignments:

def get_image_assignments(im2txt):
    '''                                                                                                                                                                                                    
    returns a list assignments of length N_images such that assignments[i] is the sentence index that image i was assigned to.                                                                             
    '''
    # if there are more images than texts, not quite sure what to do...                                                                                                                                    
    im_idxs_s, txt_idxs_s, sol = linear_assignment.base_solve(-im2txt)
    im2txt_idxs = {im_idxs_s[k]: txt_idxs_s[k] for k in range(len(im_idxs_s))}
    if im2txt.shape[0] > im2txt.shape[1]:
        # there are more images than sentences. we dont want to discard images. so, for unassigned images, we will put them with their corresponding max.                                                  
        for imidx in range(len(im2txt)):
            if imidx not in im2txt_idxs:
                im2txt_idxs[imidx] = int(np.argmax(im2txt[imidx]))

    return [im2txt_idxs[idx] for idx in range(len(im2txt_idxs))]

where the base solve function is:

def base_solve(W, max_dummy_cost_value=1000):
    '''                                                                                                                                                                                                    
    Gives hungarian solve for a non-square matrix. it's roughly from:                                                                                                                                      

    NOTE: this ** MINIMIZES COST **. So, if you're handing sims, make sure to negate them!                                                                                                                 

    https://github.com/jmhessel/multi-retrieval/blob/master/bipartite_utils.py                                                                                                                             

    returns i_s, j_s, cost such that:                                                                                                                                                                      
    for i, j in zip(i_s, j_s)                                                                                                                                                                              

    are the (i, j) row column entries selected.                                                                                                                                                            

    cost is sum( cost[i, j] for i, j in zip(i_s, j_s) )                                                                                                                                                    

    '''
    if np.sum(np.abs(W)) > max_dummy_cost_value:
        print('Warning, you values in your matrix may be too big, please raise max_dummy_cost_value')

    orig_shape = W.shape
    if orig_shape[0] != orig_shape[1]:
    if orig_shape[0] > orig_shape[1]:
            pad_idxs = [[0, 0], [0, W.shape[0]-W.shape[1]]]
            col_pad = True
        else:
            pad_idxs = [[0, W.shape[1]-W.shape[0]], [0, 0]]
            col_pad = False
        W = np.pad(W, pad_idxs, 'constant', constant_values=max_dummy_cost_value)

    sol, _, cost = lapjv(W)

    i_s = np.arange(len(sol))
    j_s = sol[i_s]

    sort_idxs = np.argsort(-W[i_s, j_s])
    i_s, j_s = map(lambda x: x[sort_idxs], [i_s, j_s])

    if orig_shape[0] != orig_shape[1]:
    if col_pad:
            valid_idxs = np.where(j_s < orig_shape[1])[0]
        else:
            valid_idxs = np.where(i_s < orig_shape[0])[0]
        i_s, j_s = i_s[valid_idxs], j_s[valid_idxs]

    # indices = np.hstack([np.expand_dims(i_s, -1), np.expand_dims(j_s, -1)]).astype(np.int32)                                                                                                             
    m_cost = 0.0
    for i, j in zip(i_s, j_s):
        m_cost += W[i, j]

    return i_s, j_s, m_cost

When I run that similarity matrix through this code, I get what I believe to be the correct assignment:

im txt sim
0 0 0.33591771125793457
1 1 0.27460938692092896
2 2 0.15680742263793945
3 3 0.237198144197464

But, this is not reflected in the matched_text_index field or the matched_sim field. I'll need to take a closer look at this shortly.

jmhessel commented 1 year ago

Hi @fauconnier --- I tracked down the bug! Thanks for reporting it. We didn't notice it because it only affects a subset of documents, and the issue was hard to track down. Here's what's going on:

Thanks for helping us track this down! I can handle the updates and next steps from here --- I'll keep you posted once I've fixed everything.

jmhessel commented 1 year ago

More updates:

I added the script we used to compute assignments, which will now save the correct assignment.

https://github.com/allenai/mmc4/blob/main/scripts/compute_assignments.py

as soon as I can, I will run this over the whole database and update the mmc4 documents in-place.

fauconnier commented 1 year ago

Fantastic. Thank you @jmhessel for the quick turnaround!

jmhessel commented 1 year ago

Hi @fauconnier , I wrote up the fix and am running it over everything. v1.1 of the corpus will be out ASAP. here's what this doc looks like in the new version :-)

{'image_info': [{'face_detections': None,
                 'image_name': '202706b24ac4.png',
                 'matched_sim': 0.33591771125793457,
                 'matched_text_index': 0,
                 'raw_url': 'https://st.hzcdn.com/static/badge_16_8@2x.png'},
                {'face_detections': None,
                 'image_name': 'a97c58871c38.png',
                 'matched_sim': 0.27460938692092896,
                 'matched_text_index': 1,
                 'raw_url': 'https://st.hzcdn.com/static/badge_34_9@2x.png'},
                {'face_detections': None,
                 'image_name': 'ce3c9aa070ce.png',
                 'matched_sim': 0.15680742263793945,
                 'matched_text_index': 2,
                 'raw_url': 'https://st.hzcdn.com/static/badge_19_9@2x.png'},
                {'face_detections': None,
                 'image_name': 'c22425c7d977.png',
                 'matched_sim': 0.237198144197464,
                 'matched_text_index': 3,
                 'raw_url': 'https://st.hzcdn.com/static/badge_20_9@2x.png'}],
aleSuglia commented 1 year ago

Hey @jmhessel, what is the ETA for this? I hope everything is running smoothly :)

Thanks so much for the prompt response on this!

jmhessel commented 1 year ago

Hi Alessandro --- hope you're well! actually, the new versions are ready and passing all of my known checks. I was about to push the release, but then I saw https://github.com/allenai/mmc4/issues/12 , so I am checking on that now

jmhessel commented 1 year ago

Marking this as resolved by https://github.com/allenai/mmc4/pull/13