aimagelab / meshed-memory-transformer

Meshed-Memory Transformer for Image Captioning. CVPR 2020
BSD 3-Clause "New" or "Revised" License
507 stars 136 forks source link

Generating HDF5 detections from custom dataset or bottom-up-attention TSV #49

Closed SandroJijavadze closed 3 years ago

SandroJijavadze commented 3 years ago

I have a custom dataset,

I have generated the detections TSV using : https://github.com/airsplay/py-bottom-up-attention But the model requires HDF5.

TSV has these per each example:

{
   'image_id': image_id,
   'image_h': np.size(im, 0),
   'image_w': np.size(im, 1),
   'num_boxes' : len(keep_boxes),
   'boxes': base64.b64encode(cls_boxes[keep_boxes]),
   'features': base64.b64encode(pool5[keep_boxes])
}  

When examining the coco dataset examples I see the following for example:

>>> dts["35368_boxes"]
<HDF5 dataset "35368_boxes": shape (37, 4), type "<f4">
>>> dts["35368_features"]
<HDF5 dataset "35368_features": shape (37, 2048), type "<f4">
>>> dts["35368_cls_prob"]
<HDF5 dataset "35368_cls_prob": shape (37, 1601), type "<f4">
>>> dts["35368_boxes"][36]
array([349.57147, 154.07967, 420.0327 , 408.64462], dtype=float32)

I'll try to figure out how to convert my TSV to required HDF5 myself from the code but guide would be appreciated.

Thank you.

hcwei13 commented 3 years ago

I have a custom dataset,

I have generated the detections TSV using : https://github.com/airsplay/py-bottom-up-attention But the model requires HDF5.

TSV has these per each example:

{
   'image_id': image_id,
   'image_h': np.size(im, 0),
   'image_w': np.size(im, 1),
   'num_boxes' : len(keep_boxes),
   'boxes': base64.b64encode(cls_boxes[keep_boxes]),
   'features': base64.b64encode(pool5[keep_boxes])
}  

When examining the coco dataset examples I see the following for example:

>>> dts["35368_boxes"]
<HDF5 dataset "35368_boxes": shape (37, 4), type "<f4">
>>> dts["35368_features"]
<HDF5 dataset "35368_features": shape (37, 2048), type "<f4">
>>> dts["35368_cls_prob"]
<HDF5 dataset "35368_cls_prob": shape (37, 1601), type "<f4">
>>> dts["35368_boxes"][36]
array([349.57147, 154.07967, 420.0327 , 408.64462], dtype=float32)

I'll try to figure out how to convert my TSV to required HDF5 myself from the code but guide would be appreciated.

Thank you.

Do you solve this problem?

SandroJijavadze commented 3 years ago

@whongchen No unfortunately, I am going to try figure out the process myself this week. Will give update if I do. Please comment if you find anything useful.

eugeniotonanzi commented 3 years ago

I'm working on this either, still haven't done it myself but I think you just need to convert the tsv into a hdf5 file, it has nothing to do with M2T or py-bottom-up-attention code. You read your tsv using csv or pandas and then you can use libraries like h5py to store and save your data in hdf5 format using names "_boxes", "_features" and "_cls_prob", in which you put data relative to bounding box corners, feature vectors and class probabilities, as specified in M2T repo readme file. I believe it would be straightforward, don't know about how much time it would take. Let me know if you manage to do it

MatteoStefanini commented 3 years ago

Hi everyone, thank you @eugeniotonanzi for your answer, that should exactly solve the problem. Once you have a hdf5 file for your custom dataset with the same format, the model should work as expected. Let us know if you have any other issues. Best, Matteo

SandroJijavadze commented 3 years ago

That solved it, closing this issue. Thank you.

hwbhwbgao commented 2 years ago

That solved it, closing this issue. Thank you. Have you solved this problem, can it be convenient to release the relevant code, thank you!

ksz-creat commented 2 years ago

That solved it, closing this issue. Thank you. Hi, have you solved this problem, can it be convenient to release the relevant code, thank you very much

MikeMACintosh commented 2 years ago

@eugeniotonanzi thanks for your advice, I'm working with it right now, but maybe you've already implemented it?

SandroJijavadze commented 2 years ago

@hwbhwbgao @ksz-creat @MikeMACintosh I didn't see your replies. Unfortunately I can't share the whole code, but I will share relevant bits I modified 2 methods in https://github.com/peteanderson80/bottom-up-attention

def get_detections_from_im(net, im_file, image_id, conf_thresh=0.2):
    im = cv2.imread(im_file)
    scores, boxes, attr_scores, rel_scores = im_detect(net, im)

    # Keep the original boxes, don't worry about the regresssion bbox outputs
    rois = net.blobs['rois'].data.copy()
    # unscale back to raw image space
    blobs, im_scales = _get_blobs(im, None)

    cls_boxes = rois[:, 1:5] / im_scales[0]
    cls_prob = net.blobs['cls_prob'].data
    pool5 = net.blobs['pool5_flat'].data

    # Keep only the best detections
    max_conf = np.zeros((rois.shape[0]))
    for cls_ind in range(1,cls_prob.shape[1]):
        cls_scores = scores[:, cls_ind]
        dets = np.hstack((cls_boxes, cls_scores[:, np.newaxis])).astype(np.float32)
        keep = np.array(nms(dets, cfg.TEST.NMS))
        max_conf[keep] = np.where(cls_scores[keep] > max_conf[keep], cls_scores[keep], max_conf[keep])

    keep_boxes = np.where(max_conf >= conf_thresh)[0]
    if len(keep_boxes) < MIN_BOXES:
        keep_boxes = np.argsort(max_conf)[::-1][:MIN_BOXES]
    elif len(keep_boxes) > MAX_BOXES:
        keep_boxes = np.argsort(max_conf)[::-1][:MAX_BOXES]
    featureid = "".join([s.lstrip("0") for s in image_id.split() if s.isdigit()])
    num_boxes = len(keep_boxes)
    cls_boxes = cls_boxes[keep_boxes].reshape((num_boxes, 4))
    cls_features = pool5[keep_boxes].reshape(num_boxes, 2048)
    cls_prob = cls_prob[keep_boxes].reshape(num_boxes, 1601)

    return (featureid + "_boxes", cls_boxes), (featureid + "_features", cls_features), (featureid + "_cls_prob", cls_prob)

https://github.com/peteanderson80/bottom-up-attention/blob/master/tools/generate_tsv.py#L140

def generate_hdf5(gpu_id, prototxt, weights, image_ids, outfile):
    wanted_ids = set([int(image_id[1]) for image_id in image_ids])
    found_ids = set()

    missing = wanted_ids - found_ids
    if len(missing) == 0:
        print 'GPU {:d}: already completed {:d}'.format(gpu_id, len(image_ids))
    else:
        print 'GPU {:d}: missing {:d}/{:d}'.format(gpu_id, len(missing), len(image_ids))
    if len(missing) > 0:
        caffe.set_mode_gpu()
        caffe.set_device(gpu_id)
        net = caffe.Net(prototxt, caffe.TEST, weights=weights)
        with h5py.File(outfile, 'w') as h5pyfile:
           # writer = csv.DictWriter(tsvfile, delimiter = '\t', fieldnames = FIELDNAMES)
            _t = {'misc' : Timer()}
            count = 0
            for im_file,image_id in image_ids:
                if int(image_id) in missing:
                    _t['misc'].tic()
                    boxes, features, probabilities = get_detections_from_im(net, im_file, image_id)
                    h5pyfile.create_dataset(boxes[0], data=boxes[1])
                    h5pyfile.create_dataset(features[0], data=features[1])
                    h5pyfile.create_dataset(probabilities[0], data=probabilities[1])
                    if (count % 100) == 0:
                        print 'GPU {:d}: {:d}/{:d} {:.3f}s (projected finish: {:.2f} hours)' \
                              .format(gpu_id, count+1, len(missing), _t['misc'].average_time,
                              _t['misc'].average_time*(len(missing)-count)/3600)
                    count += 1

Also depending on how have you arranged your data you will need to modify "load_image_ids" method.

You can use this docker image for environment: https://hub.docker.com/r/airsplay/bottom-up-attention

hwbhwbgao commented 2 years ago

Thank you very much!

Dufresue commented 8 months ago

@hwbhwbgao @ksz-creat @MikeMACintosh I didn't see your replies. Unfortunately I can't share the whole code, but I will share relevant bits I modified 2 methods in https://github.com/peteanderson80/bottom-up-attention

def get_detections_from_im(net, im_file, image_id, conf_thresh=0.2):
    im = cv2.imread(im_file)
    scores, boxes, attr_scores, rel_scores = im_detect(net, im)

    # Keep the original boxes, don't worry about the regresssion bbox outputs
    rois = net.blobs['rois'].data.copy()
    # unscale back to raw image space
    blobs, im_scales = _get_blobs(im, None)

    cls_boxes = rois[:, 1:5] / im_scales[0]
    cls_prob = net.blobs['cls_prob'].data
    pool5 = net.blobs['pool5_flat'].data

    # Keep only the best detections
    max_conf = np.zeros((rois.shape[0]))
    for cls_ind in range(1,cls_prob.shape[1]):
        cls_scores = scores[:, cls_ind]
        dets = np.hstack((cls_boxes, cls_scores[:, np.newaxis])).astype(np.float32)
        keep = np.array(nms(dets, cfg.TEST.NMS))
        max_conf[keep] = np.where(cls_scores[keep] > max_conf[keep], cls_scores[keep], max_conf[keep])

    keep_boxes = np.where(max_conf >= conf_thresh)[0]
    if len(keep_boxes) < MIN_BOXES:
        keep_boxes = np.argsort(max_conf)[::-1][:MIN_BOXES]
    elif len(keep_boxes) > MAX_BOXES:
        keep_boxes = np.argsort(max_conf)[::-1][:MAX_BOXES]
    featureid = "".join([s.lstrip("0") for s in image_id.split() if s.isdigit()])
    num_boxes = len(keep_boxes)
    cls_boxes = cls_boxes[keep_boxes].reshape((num_boxes, 4))
    cls_features = pool5[keep_boxes].reshape(num_boxes, 2048)
    cls_prob = cls_prob[keep_boxes].reshape(num_boxes, 1601)

    return (featureid + "_boxes", cls_boxes), (featureid + "_features", cls_features), (featureid + "_cls_prob", cls_prob)

https://github.com/peteanderson80/bottom-up-attention/blob/master/tools/generate_tsv.py#L140

def generate_hdf5(gpu_id, prototxt, weights, image_ids, outfile):
    wanted_ids = set([int(image_id[1]) for image_id in image_ids])
    found_ids = set()

    missing = wanted_ids - found_ids
    if len(missing) == 0:
        print 'GPU {:d}: already completed {:d}'.format(gpu_id, len(image_ids))
    else:
        print 'GPU {:d}: missing {:d}/{:d}'.format(gpu_id, len(missing), len(image_ids))
    if len(missing) > 0:
        caffe.set_mode_gpu()
        caffe.set_device(gpu_id)
        net = caffe.Net(prototxt, caffe.TEST, weights=weights)
        with h5py.File(outfile, 'w') as h5pyfile:
           # writer = csv.DictWriter(tsvfile, delimiter = '\t', fieldnames = FIELDNAMES)
            _t = {'misc' : Timer()}
            count = 0
            for im_file,image_id in image_ids:
                if int(image_id) in missing:
                    _t['misc'].tic()
                    boxes, features, probabilities = get_detections_from_im(net, im_file, image_id)
                    h5pyfile.create_dataset(boxes[0], data=boxes[1])
                    h5pyfile.create_dataset(features[0], data=features[1])
                    h5pyfile.create_dataset(probabilities[0], data=probabilities[1])
                    if (count % 100) == 0:
                        print 'GPU {:d}: {:d}/{:d} {:.3f}s (projected finish: {:.2f} hours)' \
                              .format(gpu_id, count+1, len(missing), _t['misc'].average_time,
                              _t['misc'].average_time*(len(missing)-count)/3600)
                    count += 1

Also depending on how have you arranged your data you will need to modify "load_image_ids" method.

You can use this docker image for environment: https://hub.docker.com/r/airsplay/bottom-up-attention

thank you very much for the work you did, btw, i am not familiar with docker, would you please tell me how to use the docker image you provide? where should i modify? looking forward to your reply!