How to get names of images from idx of indexes_all in DC_V2?

DC95 commented 3 years ago

❓How to get names of images from IDX of indexes_all in DC_V2?

Dear @iseessel, @QuentinDuval

Introduction -

Since the deepclusterv2_loss.py uses only loss_config therefore I cannot use what is suggested in #401. #401 uses other pieces of information from the main_config, for example, such as self.data_sources

I understand DC_v2_loss.py does not require the rest of the config information for loss calculation.

What have I done and understand -

For evaluation of the clusters, I have dumped assignments and all index values corresponding to assignments. (example attached )
For data loading, I see dataloader is used here, _but I don't know how to get image names from the data loader_

Question -

Can somehow all the configurations be called so that I can use what is suggested in #401

Cheers, DC

iseessel commented 3 years ago

Hi there @DC95 -- I do agree it would be hard to extract the image names for the clusters while they are training.

However, it is easy to do so after the fact. You can use https://github.com/facebookresearch/vissl/issues/401 to save the image_paths, e.g. after this is called: https://github.com/facebookresearch/vissl/blob/87ef9a19c193dc91add9dd56bee1b9b701693672/vissl/data/ssl_dataset.py#L313

Then you can match the index of this array to the index that you have in indeces.

Which way of loading data are you using?Diskfilelist, Diskfolder, or Torchvision?

DC95 commented 3 years ago

I have images in a folder (Train and Test separate) and I register them in dataset_catalog.json. So I think it's the Disk folder.

iseessel commented 3 years ago

Got it -- then the above should work -- let me know if I'm not being clear!

DC95 commented 2 years ago

Dear @iseessel,

As suggested by you, I looked into the file. But I have some questions-

What I understood is that self._load_labels() (line 313) is meant for loading labels (if the dataset has one).
But maybe I didn't tell you before -

I have unlabelled satellite images. I have applied DC_v2, and now I have the cluster ids and corresponding image indices. But I want to get the image file names.

Can you kindly tell me once more how to proceed with the solution? Maybe I misunderstood you.

iseessel commented 2 years ago

Something like the following should work:

    def __getitem__(self, idx: int):
        """
        Get the input sample for the minibatch for a specified data index.
        For each data object (if we are loading several datasets in a minibatch),
        we get the sample: consisting of {
            - image data,
            - label (if applicable) otherwise idx
            - data_valid: 0 or 1 indicating if the data is valid image
            - data_idx : index of the data in the dataset for book-keeping and debugging
        }
        Once the sample data is available, we apply the data transform on the sample.
        The final transformed sample is returned to be added into the minibatch.
        """

        if not self._labels_init and len(self.label_sources) > 0:
            self._load_labels()
            self._labels_init = True

         from vissl.utils.io import save_file
         save_file(self.label_paths(), "/path/to/labels.json")

This file loads the images and optionally the labels if you have them.

DC95 commented 2 years ago

Thanks, let me try :)

DC95 commented 2 years ago

Dear @iseessel

I am sorry to report but,

I have tried both- save_file(self.label_paths, "/path/to/label_path.txt") (txt file attached) and save_file(self.data_paths, "/path/to/data_path.txt") (txt file attached)

with self.label_paths it's coming as an empty list and more lists than no. of training files.

Kind note - If I use self.label_paths () then it's giving TypeError: 'list' object is not callable, that's why I removed the brackets.

data_path.txt label_path.txt

iseessel commented 2 years ago

Sorry about that -- I had a typo in my original. Can you try something like:

    def __getitem__(self, idx: int):
        """
        Get the input sample for the minibatch for a specified data index.
        For each data object (if we are loading several datasets in a minibatch),
        we get the sample: consisting of {
            - image data,
            - label (if applicable) otherwise idx
            - data_valid: 0 or 1 indicating if the data is valid image
            - data_idx : index of the data in the dataset for book-keeping and debugging
        }
        Once the sample data is available, we apply the data transform on the sample.
        The final transformed sample is returned to be added into the minibatch.
        """

        if not self._labels_init and len(self.label_sources) > 0:
            self._load_labels()
            self._labels_init = True

         from vissl.utils.io import save_file
         save_file(self.get_image_paths(), "/path/to/labels.json")

If it doesn't work out of the box. Look at the method getitem in disk_dataset.py. Here we are getting the image based on the idx.

You can see that

        image_path = self.image_dataset[idx]

DC95 commented 2 years ago

Dear @iseessel

I followed your suggestion but -

In disk_dataset.py, line 139- image_path = self.image_dataset[idx] has information like - (<PIL.Image.Image image mode=RGB size=128x128 at 0x2B69DB5866A0>, 0) for each image but this does not contain information like I thought.
One more thing interesting is that I looked at the IDX values getting called in def getitem() of disk_dataset.py. It is storing IDX values almost twice the number of training size, each IDX is twice present there.
In ssl_dataset.py by saving self.get_image_paths(), like suggested by you save_file(self.get_image_paths(), "/path/to/labels.json"), keeps on running the program infinitely.
What I observed that def get_image_paths(self): from ssl_dataset has the information of datapaths but not in the order in which they are being called.

Regards, DC

iseessel commented 2 years ago

@DC95 Can you post the config you are training + your dataset catalog + the code you are using to save the cluster assignments + indexes above.

Re your points

2 -> Can you be more specific. What do you mean by this?

3 -> You need to call this only once. You can use something like self.saved_results

4 -> Yes that is correct. I would recommend saving the assignment indexes, not in the order in which they are called, but instead based on inputs["data_idx"]. e.g. here: https://github.com/facebookresearch/vissl/blob/master/vissl/losses/deepclusterv2_loss.py#L117

DC95 commented 2 years ago

Hi @iseessel,

I am posting the files asked by you

JSON file, YAML file, .py file can't be uploaded here that's why I am attaching in txt format.
I am doing some changes in the deep cluster_v2 loss to have the assignment and the indexes dumped. Like in deepclusterv2_loss.py changed part is shown like this -
```
#################################
changed code
################################## 
```
can't be posted here so attaching txt version

More explanation about point 2 from above post - If I save the idx values from for a run of only 1 epoch- image_path = self.image_dataset[idx] save_file(idx,"/p/project/deepacf/kiste/DC/juelich_2x85_128x128_15k/checkpoints_train_400ep_exp/disk_dataset_getitem_idx1.json")

Then let's say I have a training image dataset of 15077 I am getting more than 15077 idx values

dcv2_2x85_rnet_128x128_Juelich.txt deepclusterv2_loss.txt dataset_catalog.txt

Regards, DC .

iseessel commented 2 years ago

Thanks for the info @DC95!

I was able to get my hands dirty and this is what I found that works -- check this commit: https://github.com/iseessel/vissl/commit/da35a50113feb16884e9a87d72e4ebb27a4bd75d

The indexes of samples.npy should match the indexes of the assignments.

Now, regarding:

One more thing interesting is that I looked at the IDX values getting called in def getitem() of disk_dataset.py. It is storing IDX values almost twice the number of training size, each IDX is twice present there.

This should not be happening and I was unable to repro it with a test dataset of 100 train and test images. For this:

Can you please verify the number of images in your folder: /p/project/deepacf/kiste/DC/dataset/hope2013_128x128_image/train_juelich_128x128_15k.
Can you run your config with my above commit and verify the indexes are indeed above what's expected.
If 1/2 still fail, can you send the output of samples.npy.

DC95 commented 2 years ago

Dear @iseessel

Thanks for jumping into the mud. I agree that the commit works nicely specially the linesave_file(self.image_dataset.samples, "/private/home/iseessel/samples.npy") . I am attaching the output sample. npy in zip

samples.zip

I have one thing to ask -

if I see the positional order in which images names are present in sample .npy. Do they correspond to the same index order appearing in indexes_all from deepcluster_v2?
I think the answer is yes, but just asking.

I will send more analysis of this by tomorrow or Monday.

I am also attaching the assignment and indexes in zip. assignment_indexes.zip .

iseessel commented 2 years ago

Yep you are right -- they do match!

I took a look at your attached files and they all match what I would expect!

DC95 commented 2 years ago

Well, then the journey of finding image_names, indexes, and assignments comes to an end, I see. Thanks, @iseessel, very much for the constant support.

Cheers, DC

iseessel commented 2 years ago

feel free to open back up if any probs!

facebookresearch / vissl

How to get names of images from idx of indexes_all in DC_V2? #423

❓How to get names of images from IDX of indexes_all in DC_V2?