Berkeley-Data / hpt

MIT License
2 stars 3 forks source link

Minimum Pretraining Code with Sen12MS #1

Open taeil opened 3 years ago

taeil commented 3 years ago
taeil commented 3 years ago

Trying to calculate mean/std and getting error. command

cd src/data/
./compute-dataset-pixel-mean-std.py --data /scratch/crguest/data/sen12ms_small 

error

    main(parser.parse_args())
  File "./compute-dataset-pixel-mean-std.py", line 49, in main
    for data, _ in loader:
  File "/scratch/crguest/miniconda3/envs/hp120/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in __next__
    return self._process_data(data)
  File "/scratch/crguest/miniconda3/envs/hp120/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/scratch/crguest/miniconda3/envs/hp120/lib/python3.7/site-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
OSError: Caught OSError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/scratch/crguest/miniconda3/envs/hp120/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/scratch/crguest/miniconda3/envs/hp120/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/scratch/crguest/miniconda3/envs/hp120/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/scratch/crguest/miniconda3/envs/hp120/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 138, in __getitem__
    sample = self.loader(path)
  File "/scratch/crguest/miniconda3/envs/hp120/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 174, in default_loader
    return pil_loader(path)
  File "/scratch/crguest/miniconda3/envs/hp120/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 156, in pil_loader
    img = Image.open(f)
  File "/scratch/crguest/miniconda3/envs/hp120/lib/python3.7/site-packages/PIL/Image.py", line 2818, in open
    raise IOError("cannot identify image file %r" % (filename if filename else fp))
OSError: cannot identify image file <_io.BufferedReader name='/scratch/crguest/data/sen12ms_small/p291.tif_spring/ROIs1158_spring_s2_121_p291.tif'>
(hp120) ➜  data git:(taeil) ✗ ls -al /scratch/crguest/data/sen12ms_small/p291.tif_spring
total 2492
drwxrwxr-x   2 crguest crguest    4096 Mar  5 23:10 .
drwxrwxr-x 247 crguest crguest   45056 Mar  5 23:10 ..
-rwxrwxr-x   1 crguest crguest  262788 Mar  5 23:10 ROIs1158_spring_lc_121_p291.tif
-rwxrwxr-x   1 crguest crguest  525172 Mar  5 23:10 ROIs1158_spring_s1_121_p291.tif
-rwxrwxr-x   1 crguest crguest 1706432 Mar  5 23:10 ROIs1158_spring_s2_121_p291.tif
taeil commented 3 years ago

changed the folder path not to have a dot just in case, but still no luck.

suryagutta commented 3 years ago

There is no issue with normal jpg images or Tiffs from BigEarthNet. Also, there is no issue with SEN12MS LC image. The issue is with only SEN12MS S1 and S2 images. These sentinal images are having multiple bands in the same image where as BigEarthNet has separate image for each band.

We can reproduce the issue with a simple python script:

from PIL import Image  
path = "/scratch/crguest/data/sen12ms_small3/test/p124_summer/ROIs1868_summer_s2_11_p124.tif"  
page = Image.open(path)
suryagutta commented 3 years ago

We are using Image reader from PILLOW https://readthedocs.org/projects/pillow/downloads/pdf/latest/ On page 21? It has the following: However, Pillow doesn’t support user-defined modes; if you need to handle band combinations that are not listed above, use a sequence of Image objects. Checking on that.

suryagutta commented 3 years ago

There could be a limitation with PyTorch if we stack more channel images in one single Tiff file. https://github.com/pytorch/vision/issues/514
Comment: "I just discovered a way to do this, I am not sure it can be solution to your problem, but I'll share it in case it can be useful to others. In my case I had multi-channel Tiff images, and I wanted to classify them using CNNs in Pytorch. I honestly gave up on data augmentation using Transforms in Pytorch, and I performed data augmentation offline (let's say in my input folders I have original data as well as augmented ones). The game changer is however defining your own loader + taking advantage of Tifffile library in python. This is how I did it for my training set (val and test should be the same):"

import tifffile
def my_tiff_loader(filename):
    return tifffile.imread(filename)

train_transform = transforms.Compose([transforms.ToTensor()])

train_data = datasets.ImageFolder('PATH TO TRAINSET', loader=my_tiff_loader, transform=train_transform)

Note: S1 has only 2 bands. LC has 4 bands and it is working. But, LC size is small.

taeil commented 3 years ago

Found custom data loader code with SEN12MS. We are unblocked for now. Some additional reading that is helpful for everyone.

suryagutta commented 3 years ago

So, if we have many stacked or custom mode channels in satellite imagery, then we need to use code that uses rasterio instead of Pillow.

taeil commented 3 years ago

Added additional pipeline and datasources for sen12ms. good progress are made. see the changes on taeil branch. There is chance that we can run pre-training tomorrow.

taeil commented 3 years ago

https://github.com/Berkeley-Data/OpenSelfSup/pull/1