Closed DC95 closed 2 years ago
Hi there @DC95 -- I do agree it would be hard to extract the image names for the clusters while they are training.
However, it is easy to do so after the fact. You can use https://github.com/facebookresearch/vissl/issues/401 to save the image_paths, e.g. after this is called: https://github.com/facebookresearch/vissl/blob/87ef9a19c193dc91add9dd56bee1b9b701693672/vissl/data/ssl_dataset.py#L313
Then you can match the index of this array to the index that you have in indeces.
Which way of loading data are you using?Diskfilelist, Diskfolder, or Torchvision?
I have images in a folder (Train and Test separate) and I register them in dataset_catalog.json. So I think it's the Disk folder.
Got it -- then the above should work -- let me know if I'm not being clear!
Dear @iseessel,
As suggested by you, I looked into the file. But I have some questions-
What I understood is that self._load_labels() (line 313) is meant for loading labels (if the dataset has one).
But maybe I didn't tell you before -
Can you kindly tell me once more how to proceed with the solution? Maybe I misunderstood you.
Something like the following should work:
def __getitem__(self, idx: int):
"""
Get the input sample for the minibatch for a specified data index.
For each data object (if we are loading several datasets in a minibatch),
we get the sample: consisting of {
- image data,
- label (if applicable) otherwise idx
- data_valid: 0 or 1 indicating if the data is valid image
- data_idx : index of the data in the dataset for book-keeping and debugging
}
Once the sample data is available, we apply the data transform on the sample.
The final transformed sample is returned to be added into the minibatch.
"""
if not self._labels_init and len(self.label_sources) > 0:
self._load_labels()
self._labels_init = True
from vissl.utils.io import save_file
save_file(self.label_paths(), "/path/to/labels.json")
This file loads the images and optionally the labels if you have them.
Thanks, let me try :)
Dear @iseessel
I am sorry to report but,
I have tried both-
save_file(self.label_paths, "/path/to/label_path.txt")
(txt file attached)
and
save_file(self.data_paths, "/path/to/data_path.txt")
(txt file attached)
with self.label_paths it's coming as an empty list and more lists than no. of training files.
Kind note - If I use self.label_paths () then it's giving TypeError: 'list' object is not callable, that's why I removed the brackets.
Sorry about that -- I had a typo in my original. Can you try something like:
def __getitem__(self, idx: int):
"""
Get the input sample for the minibatch for a specified data index.
For each data object (if we are loading several datasets in a minibatch),
we get the sample: consisting of {
- image data,
- label (if applicable) otherwise idx
- data_valid: 0 or 1 indicating if the data is valid image
- data_idx : index of the data in the dataset for book-keeping and debugging
}
Once the sample data is available, we apply the data transform on the sample.
The final transformed sample is returned to be added into the minibatch.
"""
if not self._labels_init and len(self.label_sources) > 0:
self._load_labels()
self._labels_init = True
from vissl.utils.io import save_file
save_file(self.get_image_paths(), "/path/to/labels.json")
If it doesn't work out of the box. Look at the method getitem in disk_dataset.py
. Here we are getting the image based on the idx.
You can see that
image_path = self.image_dataset[idx]
Dear @iseessel
I followed your suggestion but -
In disk_dataset.py, line 139-
image_path = self.image_dataset[idx]
has information like - (<PIL.Image.Image image mode=RGB size=128x128 at 0x2B69DB5866A0>, 0) for each image but this does not contain information like I thought.
One more thing interesting is that I looked at the IDX values getting called in def getitem() of disk_dataset.py. It is storing IDX values almost twice the number of training size, each IDX is twice present there.
In ssl_dataset.py by saving self.get_image_paths(), like suggested by you save_file(self.get_image_paths(), "/path/to/labels.json")
, keeps on running the program infinitely.
What I observed that def get_image_paths(self): from ssl_dataset has the information of datapaths but not in the order in which they are being called.
Regards, DC
@DC95 Can you post the config you are training + your dataset catalog + the code you are using to save the cluster assignments + indexes above.
Re your points
2 -> Can you be more specific. What do you mean by this?
3 -> You need to call this only once. You can use something like self.saved_results
4 -> Yes that is correct. I would recommend saving the assignment indexes, not in the order in which they are called, but instead based on inputs["data_idx"]. e.g. here: https://github.com/facebookresearch/vissl/blob/master/vissl/losses/deepclusterv2_loss.py#L117
Hi @iseessel,
I am posting the files asked by you
#################################
changed code
##################################
More explanation about point 2 from above post -
If I save the idx values from for a run of only 1 epoch-
image_path = self.image_dataset[idx] save_file(idx,"/p/project/deepacf/kiste/DC/juelich_2x85_128x128_15k/checkpoints_train_400ep_exp/disk_dataset_getitem_idx1.json")
Then let's say I have a training image dataset of 15077 I am getting more than 15077 idx values
dcv2_2x85_rnet_128x128_Juelich.txt deepclusterv2_loss.txt dataset_catalog.txt
Regards, DC .
Thanks for the info @DC95!
I was able to get my hands dirty and this is what I found that works -- check this commit: https://github.com/iseessel/vissl/commit/da35a50113feb16884e9a87d72e4ebb27a4bd75d
The indexes of samples.npy
should match the indexes of the assignments.
Now, regarding:
One more thing interesting is that I looked at the IDX values getting called in def getitem() of disk_dataset.py. It is storing IDX values almost twice the number of training size, each IDX is twice present there.
This should not be happening and I was unable to repro it with a test dataset of 100 train and test images. For this:
/p/project/deepacf/kiste/DC/dataset/hope2013_128x128_image/train_juelich_128x128_15k
. samples.npy
. Dear @iseessel
Thanks for jumping into the mud. I agree that the commit works nicely specially the linesave_file(self.image_dataset.samples, "/private/home/iseessel/samples.npy")
.
I am attaching the output sample. npy
in zip
I have one thing to ask -
I will send more analysis of this by tomorrow or Monday.
I am also attaching the assignment and indexes
in zip.
assignment_indexes.zip
.
Yep you are right -- they do match!
I took a look at your attached files and they all match what I would expect!
Well, then the journey of finding image_names, indexes, and assignments comes to an end, I see. Thanks, @iseessel, very much for the constant support.
Cheers, DC
feel free to open back up if any probs!
❓How to get names of images from IDX of indexes_all in DC_V2?
Dear @iseessel, @QuentinDuval
Introduction -
Since the deepclusterv2_loss.py uses only loss_config therefore I cannot use what is suggested in #401. #401 uses other pieces of information from the main_config, for example, such as self.data_sources
I understand DC_v2_loss.py does not require the rest of the config information for loss calculation.
What have I done and understand -
Question -
Cheers, DC