Issues about CoarseDropout

konioyxgq commented 4 years ago

I use CoarseDropout to process the dataset. I observed the processed pictures and found that there are some pictures where the drpout area is exactly the same. But according to my understanding, the data enhancement should be random, it is impossible to have the same data enhancement result. Where did I get it wrong? Or this is the expected effect of the code. Do you have any advice? Thank you very much！

aleju commented 4 years ago

If your grid of dropped cells is very coarse and there are thus very few cells to drop, they can and will often be identical to other sampled grids. For finer grids with many cells there is still a theoretical probability, though it might be low, depending on the chosen drop probability per cell. (I guess a very low drop probability will still often lead to the same grid, even at fine levels.)

So if this happens for fine grids and probabilities around 50% (and especially repeats with every batch) then that would indicate that there is something wrong with the implementation or the way that you use the operation (or set your seeds/random states).

konioyxgq commented 4 years ago

iaa.CoarseDropout((0.0, 0.05), size_percent=(0.02, 0.25)) This is the parameter I set. Is it reasonable to have the same dropout area in this situation?

aleju commented 4 years ago

size_percent is the size of the grid where cells are marked as dropped, relative to the original image. I.e. in the worst case you can have here a grid with 2% of the original image size, where 0% of all cells are marked as "dropped", which would obviously lead to the same pattern for all images (nothing dropped = no change to the images).

In the mean case, it will be a grid size around 13% of the image and 2.5% of the cells dropped. For a 64x64 image that is around 8.3 x 8.3 in size, i.e. around 532 values, resulting in a probability of sampling an exact sequence of only "keep this cell" (the most likely sequence) of around (1-0.02)^532 = 0.00215%. So, fairly unlikely. But you would have to adjust this a bit as you are just searching for any pair of matching sequences in a batch and you might not diligently compare all grid cells, so tiny errors would still be fine.

If we don't look at the mean case, but instead a less often occurring case, like sampling a drop probability of 0.75% and size_percent of 4%, we would get around 163 grid cells with a sequence of only "keep this cell" having probability (1-0.0075)^163 = 29.3%. That's already fairly likely. If you go down further and get a grid size of say 5x5 then it becomes obviously very likely to sample twice the same grid as there are only 25 coin flips left to do (and this is with a very biased coin).

In summary it depends on your image sizes, how many images you visually compare, whether the drop pattern is really exactly the same and how fine the grid is. For coarse grid sizes (e.g. 5x5) it is quite likely to sample multiple times the same pattern. But for very fine patterns (e.g. 40x40) it is quite unlikely, especially if there are many dropped regions (i.e. high drop probability).

konioyxgq commented 4 years ago

First of all, thank you very much for your answer. I'm sorry, I forgot to provide my picture size. My data set is full of 112 x 112 pictures. I think in this case, the probability of the same dropout region is very small. But in the thousands of processed images I looked at (There are other ways besides CoarseDropout), I found the same result. Do you have any suggestions to avoid this?

aleju commented 4 years ago

The second image looks suspicious. I would guess that is quite unlikely to sample twice. Especially as there also seems to be the same affine transformation applied. If you compare all images between the two batches you would probably see the same effect for all other image pairs.

Ultimately these problems stem from multiple pipelines using the same seeds (i.e. same random number generators). Usually, if a pipeline is defined one single time and then reused many times that shouldn't happen, as the random number generators are advanced automatically after each call. Do you maybe use any form of multicore processing, like multiprocessing or a kind of dataloader with multiple workers (pytorch, keras, tensorflow and probably many other frameworks offer these). These can cause multiple python processes (aka workers) to run with copies of the main process's objects. These copies would then also copy the random number generators, resulting in all child processes to sample the same random numbers and hence apply the same transformations to different batches of data.

konioyxgq commented 4 years ago

Yes, I used multiprocessing and a kind of dataloader with multiple workers. What can I do to avoid this phenomenon (the same imgaug result)? “These can cause multiple python processes (aka workers) to run with copies of the main process's objects. These copies would then also copy the random number generators, resulting in all child processes to sample the same random numbers and hence apply the same transformations to different batches of data.” It's means that If there are 4 processing, each processing has 4 dataloader workers , which means that there will be 16 identical imgaug results, but the imgaug method is not duplicated. In other words, I have n pictures and there will be n/16 imgaug methods. Am I right? Do you have any suggestions? Is this phenomenon due to the use of np.random in imgaug?

aleju commented 4 years ago

It's means that If there are 4 processing, each processing has 4 dataloader workers , which means that there will be 16 identical imgaug results, but the imgaug method is not duplicated. In other words, I have n pictures and there will be n/16 imgaug methods.

Not entirely sure if I can follow. If you have 4 child/worker processes then you probably have one imgaug augmentation pipeline (i.e. Sequential) defined in each one of them, which results in four pipelines. Each pipeline uses a seed value for its random number generator. (Actually each augmenter has its own RNG, but that isn't very relevant here.) If these seed values are the same between the pipelines, they will generate the same samples. E.g. if there is an Affine rotation in each pipeline, it might sample [-35deg, +5deg, +17deg, -12deg, ...] in worker1, but it will also sample the same [-35deg, +5deg, +17deg, -12deg, ...] in worker2 and the same in worker3 and worker4. Then worker1 might get images [A, B, C, D] and produces [A with -35deg rotation, B with +5deg rotation, C with +17deg rotation, D with -12deg rotation]. worker2 might get images [E, F, G, H] and produces [E with -35deg rotation, F with +5deg rotation, G with +17deg rotation, H with -12deg rotation]. Analogous for worker3 and worker4. But you want these transformations to differ. That is why you have to make sure that the seeds of the RNGs are not identical between workers.

Some methods to change these seeds:

Use imgaug's multicore methods, which already handle the seeds for you. See the docs for that. This makes sense if you only do augmentation in your workers. If you also do other stuff (like data loading), you probably don't want this.
In each worker call import imgaug as ia; ia.seed(<seedval>) and afterwards define the augmentation pipeline (so don't define the pipeline before you create the workers as that would copy the pipeline from the main process to the child processes).
Probably the easiest way: In each worker, call once seq.seed_(<seedval>) on your pipeline (assuming that seq is your augmentation pipeline, i.e. seq = iaa.Sequential(...)). You can also call seq.seed_(<seedval>) for every batch, which might be easier to implement. But then you have to make sure that seedval differs between batches.

You have to make sure that seedval is different between workers. Either you provide it via the main process (e.g. start with 0 and increase it each time a worker is created) or you have to generate it within each worker via some noise source. You could use the time in nanoseconds for that, like seq.seed_(time.time_ns()) (required python 3.7).

konioyxgq commented 4 years ago

I checked some pytorch material. I found that when using pytorch's dataloader, using a multithreaded load picture (num_workers s 4), pytorch assigns a different seed to each thread. "However, seeds for other libraries may be duplicated upon initializing workers (e.g., NumPy)，causing each worker to return identical random numbers." I did the experiment and validated the above conclusion. I found that under different pytorch seeds, there were the same data enhancement results. But I started four dataloader processes with torch.multiprocessing.spawn, each with a dataloader, num_workers = 4, but I didn't find the same data enhancement under a different seed file. It seems to be the reason for the torch.multiprocessing.spawn, but I didn't find any useful information. Why the spawn method works?

aleju commented 4 years ago

torch.multiprocessing.spawn starts the child processes via the spawn method, otherwise you use by default fork. In spawn mode the child processes are more independent from each other than in fork mode (which is also why its slower to start them in spawn mode). I could imagine that the spawn mode results in imgaug being imported independently four times, which again means that it imports numpy four times and requests each time a random seed from numpy, which again queries four times the system for a random value and that system then returns four distinct values, resulting in four distinct seeds. In fork mode that import might only happen once, resulting in a single value being requested and all child processes therefore using the same seed.

I don't know the fine details of when fork and spawn perform imports and how these are (or aren't) shared between processes, so I can't say anything for sure here.

If you have a seed in each pytorch worker process, you can also use that one to seed imgaug.

konioyxgq commented 4 years ago

Thank you very much for your answer. Now I understand a little.

aleju / imgaug

Issues about CoarseDropout #659