cvlab-stonybrook / Scanpath_Prediction

Predicting Goal-directed Human Attention Using Inverse Reinforcement Learning (CVPR2020)
MIT License
97 stars 22 forks source link

Data Preparation From Scratch #3

Closed animikhaich closed 4 years ago

animikhaich commented 4 years ago

Hi,

First of all, I'd like to say Brilliant Paper and Repository! Thank you for your hard work!

I am trying to train a custom dataset, which contains images and ground truth scan paths.

However, I am not able to find any script in the repository to generate the:

In the sample dataset, it is given as pre-computed files in .pth.tar format.

Hence, could you please guide me on how to create these belief maps from normal RGB input images?

Also, if possible, could you please walk me through the Data Preparation steps as a whole? As in, how to generate these files from raw images and annotations?

It would be really very helpful.

Thanks

ghost commented 4 years ago

@ouyangzhibo is the initial dataset in .csv format similar to the Microwave-Clock Search Dataset ?

ouyangzhibo commented 4 years ago

@ouyangzhibo is the initial dataset in .csv format similar to the Microwave-Clock Search Dataset?

Yes, at least in a similar way. This time we are releasing the dataset in cooperation with the MIT/Tuebingen Saliency Benchmark, so the format will be the same as these saliency datasets.

ouyangzhibo commented 4 years ago

Hi,

First of all, I'd like to say Brilliant Paper and Repository! Thank you for your hard work!

I am trying to train a custom dataset, which contains images and ground truth scan paths.

However, I am not able to find any script in the repository to generate the:

  • high-resolution belief maps
  • low-resolution belief maps
  • processed_human_scanpaths_TP_trainval.npy
  • coco_search_annos_512x320.npy

In the sample dataset, it is given as pre-computed files in .pth.tar format.

Hence, could you please guide me on how to create these belief maps from normal RGB input images?

Also, if possible, could you please walk me through the Data Preparation steps as a whole? As in, how to generate these files from raw images and annotations?

It would be really very helpful.

Thanks

Thanks for the question!

I updated the README to include some guidance on how to create the belief maps, please check that out.

Unfortunately, we are not going to release the scripts for processing the raw dataset as they are pretty simple and straightforward. You can generate them with a few lines of code. But these processed files will be available together with the raw dataset as well.

Hope this helps!

animikhaich commented 4 years ago

Thanks for the question!

I updated the README to include some guidance on how to create the belief maps, please check that out.

Unfortunately, we are not going to release the scripts for processing the raw dataset as they are pretty simple and straightforward. You can generate them with a few lines of code. But these processed files will be available together with the raw dataset as well.

Hope this helps!

@ouyangzhibo Thank you for the quick reply. That really cleared things out! I have a final set of questions in order to get a better understanding, please find them below:

  1. What method did you use for the resize? Any Specific Interpolation Method?
  2. From my understanding, in order to extract feature maps, the model needs to be already trained to segment/detect the related objects, right? So, for custom objects, we need to re-train the detectron2?
  3. In order to extract the feature maps for each image, which layer's output should I consider?

Thanks

ouyangzhibo commented 4 years ago

@ouyangzhibo Thank you for the quick reply. That really cleared things out! I have a final set of questions in order to get a better understanding, please find them below:

  1. What method did you use for the resize? Any Specific Interpolation Method?
  2. From my understanding, in order to extract feature maps, the model needs to be already trained to segment/detect the related objects, right? So, for custom objects, we need to re-train the detectron2?
  3. In order to extract the feature maps for each image, which layer's output should I consider?

Thanks

@animikhaich Thanks for the questions!

  1. We use torch.nn.functional.interpolate() for the resizing.
  2. You're right. For custom objects that are not in the COCO annotation, you will need to re-train the segmentation network. In our case, we directly use a pre-trained model without further fine-tuning it.
  3. We are not extracting the features maps. The final outputs of the segmentation network are used as the belief maps.
animikhaich commented 4 years ago

@ouyangzhibo, Thank you very much for clarifying everything. I am closing the issue.

ghost commented 4 years ago

Yes, at least in a similar way. This time we are releasing the dataset in cooperation with the MIT/Tuebingen Saliency Benchmark, so the format will be the same as these saliency datasets.

@ouyangzhibo

  1. Post a sample of the raw data, 2 images and their .csv file would be enough.

Unfortunately, we are not going to release the scripts for processing the raw dataset

  1. You mean the scripts to convert .csv to .npy?
  2. Is there a particular reason for not releasing them? if you are not restricted from doing that, then attach them here for sometime.
ghost commented 4 years ago

@ouyangzhibo Also, you didn't mention anything about clusters.npy , what is it? what it contain? how to create?etc..

animikhaich commented 4 years ago

We are not extracting the features maps. The final outputs of the segmentation network are used as the belief maps.

@ouyangzhibo I have tried to follow your instructions, but I have hit a roadblock. It would be very helpful if you can guide me.

The biggest challenge that I am facing is the generation of belief maps from raw images.

As per your instructions, I have performed the following tasks:

  1. Use Panoptic FPN ResNet50 of Detectron2 to extract features.
  2. In the first attempt, I tried to get the direct output of the detection network, but the panoptic segmentation output was a single channel grayscale image, shown here: image
  3. Since that did not bring out expected results, I followed the method explained in the Research Paper, specifically the following part (highlighted - Page 6, Implementation Details): image
  4. So, upon extracting the outputs of the FPN layer, I got 5 different staged feature-maps, 256 channels each. As shown below: image
  5. Since none of these feature maps matched the expected number of channels mentioned in the paper - 134 Channels - 80 Object Classes + 54 Backgrounds Classes, I dived deeper and came across this article explaining the architecture of RCNN with ResNet50 backbone on Medium.
  6. Unfortunately, in the above article, I did not find the mention of any layer output to have 134 channels.

Hence, I am unable to extract the generate the HR and LR belief maps from raw images. I would request you to post the supportive codebase for generating the maps, or guide me in a more detailed way so that I can reproduce the results.

Thanks

ouyangzhibo commented 4 years ago

2. n the first attempt, I tried to get the direct output of the detection network, but the panoptic segmentation output was a single channel grayscale image, shown here:

@animikhaich In fact, you are very close at this step. This single channel segmentation map is combined by stacking all 134 belief maps together. What you need to do is to go just one step back in getting the final output in the Panoptic-FPN, and output the belief maps for every 'thing' and 'stuff' category in COCO which is 134 in total without combining them together.

Let me know if you have other questions.

ouyangzhibo commented 4 years ago

@ouyangzhibo Also, you didn't mention anything about clusters.npy , what is it? what it contain? how to create?etc..

@deepseek it contains the clusters of testing fixations used for computing the sequence score for scanpath evaluation. Please can refer to this paper for more information.

animikhaich commented 4 years ago

@ouyangzhibo Thank you for that last bit of information, I have been able to separate the Panoptic Layers.

The panoptic layers are in the form of instance segmentation, so if there are two people in the frame, there are two separate segmentation maps for them.

However, according to the paper, there are a total of 134 channels (80+54), where each channel corresponds to the segmentation map of each object/background.

  1. Does this mean that all instances of a given object are to be merged together in one layer?
  2. If, for example, my custom dataset has only one class and one background, then will the total number of channels reduce to 2 (1+1)?
  3. In the paper it is mentioned that the input images are resized to 320×512, followed by Gaussian blur:
    • Do we resize only for generating LR belief maps? Or for both LR and HR belief maps? (I understand that Gaussian Blur is applied only for LR belie-map generation)
    • If I have a custom dataset which has different size of images, is it necessary to resize down to the same size or is it okay if I resize them to a different size as well?

I know that's a lot of questions, but it would be really helpful if you can help me clear my doubts.

Thanks

ouyangzhibo commented 4 years ago

@ouyangzhibo Thank you for that last bit of information, I have been able to separate the Panoptic Layers.

The panoptic layers are in the form of instance segmentation, so if there are two people in the frame, there are two separate segmentation maps for them.

However, according to the paper, there are a total of 134 channels (80+54), where each channel corresponds to the segmentation map of each object/background.

  1. Does this mean that all instances of a given object are to be merged together in one layer?
  2. If, for example, my custom dataset has only one class and one background, then will the total number of channels reduce to 2 (1+1)?
  3. In the paper it is mentioned that the input images are resized to 320×512, followed by Gaussian blur:

    • Do we resize only for generating LR belief maps? Or for both LR and HR belief maps? (I understand that Gaussian Blur is applied only for LR belie-map generation)
    • If I have a custom dataset which has different size of images, is it necessary to resize down to the same size or is it okay if I resize them to a different size as well?

I know that's a lot of questions, but it would be really helpful if you can help me clear my doubts.

Thanks

@animikhaich Yes, you need to merge instances for the same category together in one map.

For resizing, all input images are first resized to 320x512. Then a low-res image is obtained by applying the Gaussian Blur. The belief maps are computed on the low- and high-res images separately.

Although I did not see any difficulty in having different image size for extracting the belief maps using Panoptic-FPN, you have to keep the size of the state representation (DCB in our paper) the same. You can think the size of the input image (320x512) as well as size of the belief maps (20x32) as hyper-parameters (untested). You can choose the size of the input image according to your computational power and the size of your dataset.

animikhaich commented 4 years ago

@ouyangzhibo Thank you, your previous reply was helpful. I have successfully created belief maps. However, on trying to save the belief maps, I came across another issue.

So, the current folder structure for the sample dataset is:

├── DCBs
│   ├── HR
│   │   ├── bottle
│   │   ├── ...
│   │   └── tv
│   └── LR
│       ├── bottle
│       ├── ...
│       └── tv

The output belief map is of 134 channels, where each channel corresponds to the category_id of merged instance and background mask. If I directly save this belief map, then it would be one .pth.tar file with all category masks in it.

Then why are there separate folders bottle, bowl, ...tv for each category? And what does the belief maps in each of these categories contain? Since I have tried exploring the given data, each belief map inside each object-folder has 134 channels.

I think this is one confusing part which is not mentioned in the paper also. If you can clarify then it would be very helpful.

ouyangzhibo commented 4 years ago

@ouyangzhibo Thank you, your previous reply was helpful. I have successfully created belief maps. However, on trying to save the belief maps, I came across another issue.

So, the current folder structure for the sample dataset is:

├── DCBs
│   ├── HR
│   │   ├── bottle
│   │   ├── ...
│   │   └── tv
│   └── LR
│       ├── bottle
│       ├── ...
│       └── tv

The output belief map is of 134 channels, where each channel corresponds to the category_id of merged instance and background mask. If I directly save this belief map, then it would be one .pth.tar file with all category masks in it.

Then why are there separate folders bottle, bowl, ...tv for each category? And what does the belief maps in each of these categories contain? Since I have tried exploring the given data, each belief map inside each object-folder has 134 channels.

I think this is one confusing part which is not mentioned in the paper also. If you can clarify then it would be very helpful.

@animikhaich This is related to our COCO-Search18 dataset. There are 18 categories in our dataset. Each category is selected from COCO things (80 in total). Hence the HR/LR folder contains 18 subfolders for each category in our dataset. While in each category subfolder, there are a number of images belong to the category. For each image, we extract the 134 belief maps.

animikhaich commented 4 years ago

@ouyangzhibo

{
     'name': '000000400966.jpg',             # image name
     'subject': 2,                          # subject id
     'task': 'microwave',                   # target name
     'condition': 'present',                # target-present or target-absent
     'bbox': [67, 114, 78, 42],             # bounding box of the target object in the image
     'X': array([245.54666667, ...]),       # x-axis of each fixation
     'Y': array([128.03047619, ...]),       # y-axis of each fixation
     'T': array([190,  63, 180, 543]),      # duration of each fixation
     'length': 4,                           # length of the scanpath (i.e., number of fixations)
     'fixOnTarget': True,                   # if the scanpath lands on the target object
     'correct': 1,                          # 1 if the subject correctly located the target; 0 otherwise
     'split': 'train'                       # split of the image {'train', 'valid', 'test'}
 }

Summary of the question: How is the above generated?

Detailed Clarifications:

It would have just been more convenient of the data_preprocessing scripts were provided. Re-Creating the research becomes extremely difficult since a LOT of the steps are not clear, and/or are not defined.

ouyangzhibo commented 4 years ago

@ouyangzhibo

{
     'name': '000000400966.jpg',             # image name
     'subject': 2,                          # subject id
     'task': 'microwave',                   # target name
     'condition': 'present',                # target-present or target-absent
     'bbox': [67, 114, 78, 42],             # bounding box of the target object in the image
     'X': array([245.54666667, ...]),       # x-axis of each fixation
     'Y': array([128.03047619, ...]),       # y-axis of each fixation
     'T': array([190,  63, 180, 543]),      # duration of each fixation
     'length': 4,                           # length of the scanpath (i.e., number of fixations)
     'fixOnTarget': True,                   # if the scanpath lands on the target object
     'correct': 1,                          # 1 if the subject correctly located the target; 0 otherwise
     'split': 'train'                       # split of the image {'train', 'valid', 'test'}
 }

Summary of the question: How is the above generated?

Detailed Clarifications:

  • One .tar.pth file may contain maps of multiple classes. How do we set that "task"?
  • what does "subject id" refer to?
  • there is only one entry in "bbox", what about multiple occurrences of the same object in one image?
  • Is a new dictionary (like above) created for each class/object occurring for the same image? If so we will have multiple such dictionaries with different "task" and same "name", would that be a correct assumption?
  • What is T? How to generate that? What do you mean by "duration" of each fixation?
  • If we assume X and Y corresponds to the X, Y coordinates of the center point for each instance of the object, then for one "bbox" how can there be multiple X, Y, and T?
  • Please explain how to generate X, Y, and T?

It would have just been more convenient of the data_preprocessing scripts were provided. Re-Creating the research becomes extremely difficult since a LOT of the steps are not clear, and/or are not defined.

@animikhaich This is just how we format our dataset. You can check out the details of the COCO-Search18 dataset at supplementary of our paper and the dataset page at https://saliency.tuebingen.ai/datasets/COCO-Search18/index_new.html.

Doch88 commented 3 years ago

@ouyangzhibo Thank you, your previous reply was helpful. I have successfully created belief maps. However, on trying to save the belief maps, I came across another issue.

So, the current folder structure for the sample dataset is:

├── DCBs
│   ├── HR
│   │   ├── bottle
│   │   ├── ...
│   │   └── tv
│   └── LR
│       ├── bottle
│       ├── ...
│       └── tv

The output belief map is of 134 channels, where each channel corresponds to the category_id of merged instance and background mask. If I directly save this belief map, then it would be one .pth.tar file with all category masks in it.

Then why are there separate folders bottle, bowl, ...tv for each category? And what does the belief maps in each of these categories contain? Since I have tried exploring the given data, each belief map inside each object-folder has 134 channels.

I think this is one confusing part which is not mentioned in the paper also. If you can clarify then it would be very helpful.

Hi @animikhaich, can you share the script that you used to extract the belief maps?

I'm having problems since the segmentation output of the Panoptic Segmentation FPN from Detectron2 for me is composed by 54 layers, which I think are the stuff, and I'm not able to find the other 80.