CosmiQ / solaris

CosmiQ Works Geospatial Machine Learning Analysis Toolkit
https://solaris.readthedocs.io
Apache License 2.0
416 stars 113 forks source link

[ERROR]: Training different data -> IndexError: index 3 is out of bounds for axis 2 with size 3 #277

Closed williamobrein closed 5 years ago

williamobrein commented 5 years ago

Hello, first of all thank you for developing solaris. I've been working on object detection for a long time. But I'm new to Github. So I'm sorry for my faults!

I tried to train with my own data but I got an error. I received an error: IndexError: index 3 is out of bounds for axis 2 with size 3

As you mentioned in the document, I divided the satellite image (in tif format) into tiles. Then I divided geojson files in the same way. I did the mask creation process. Again I created my mask(footprint mask) in tif format. Then I created the training and test csv files as you specified. I have edited the configuration file of the pre-trained model xdxd_spacenet4.

Error Message

solaris_run_ml -c xdxd_spacenet4.yml

When I run this command, I get the error like above.

(solaris) deposerver@ubuntu:/mnt/depo1tb/yz/solaris/solaris/nets/configs$ solaris_run_ml -c xdxd_spacenet4.yml
/home/deposerver/anaconda3/envs/solaris/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/deposerver/anaconda3/envs/solaris/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/deposerver/anaconda3/envs/solaris/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/deposerver/anaconda3/envs/solaris/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/deposerver/anaconda3/envs/solaris/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/deposerver/anaconda3/envs/solaris/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/home/deposerver/.local/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Traceback (most recent call last):
  File "/home/deposerver/anaconda3/envs/solaris/bin/solaris_run_ml", line 10, in <module>
    sys.exit(main())
  File "/home/deposerver/anaconda3/envs/solaris/lib/python3.6/site-packages/solaris/bin/solaris_run_ml.py", line 34, in main
    inferer(inf_df)
  File "/home/deposerver/anaconda3/envs/solaris/lib/python3.6/site-packages/solaris/nets/infer.py", line 64, in __call__
    src_im_height, src_im_width) = inf_tiler(im_path)
  File "/home/deposerver/anaconda3/envs/solaris/lib/python3.6/site-packages/solaris/nets/datagen.py", line 294, in __call__
    subarr = self.aug(image=subarr)['image']
  File "/home/deposerver/anaconda3/envs/solaris/lib/python3.6/site-packages/albumentations/core/composition.py", line 176, in __call__
    data = t(force_apply=force_apply, **data)
  File "/home/deposerver/anaconda3/envs/solaris/lib/python3.6/site-packages/albumentations/core/transforms_interface.py", line 87, in __call__
    return self.apply_with_params(params, **kwargs)
  File "/home/deposerver/anaconda3/envs/solaris/lib/python3.6/site-packages/albumentations/core/transforms_interface.py", line 100, in apply_with_params
    res[key] = target_function(arg, **dict(params, **target_dependencies))
  File "/home/deposerver/anaconda3/envs/solaris/lib/python3.6/site-packages/solaris/nets/transform.py", line 101, in apply
    return np.delete(im_arr, self.idx, self.axis)
  File "<__array_function__ internals>", line 6, in delete
  File "/home/deposerver/.local/lib/python3.6/site-packages/numpy/lib/function_base.py", line 4382, in delete
    "size %i" % (obj, axis, N))
IndexError: index 3 is out of bounds for axis 2 with size 3

How can I solve this problem? Anybody have any ideas? Thanks in advance.

What should I do?

I have some questions. It would be very helpful if you could help.

  1. What should be the image format? (tif, png, jpeg etc.)
  2. RBG or BGR ? (Assuming it is tif)
  3. Does the bit information of the image matter? 24, 16, 8 etc. (Assuming it is tif)
  4. Should the mask be in the form of images? Which file format should I use exactly?
  5. Is there a service that automatically generates training and test csv? I could not see. It can be very helpful.

Environment information

nrweir commented 5 years ago

Hi @williamobrein,

Thanks for putting this in. I'll have a look and get back to you as soon as I can.

-N

nrweir commented 5 years ago

In response to your questions:

  1. Currently we support tif strongly and png weakly. GeoTIFF is strongly recommended if your labels are georegistered.
  2. If you're using a pre-trained model from solaris, use the same channel order described in the model spec. If you're training your own model, it doesn't matter.
  3. Same as (2), this only really matters if you're using a pretrained model from solaris, in which case you should normalize the same way as that model does. For example, the xdxd pretrained model z-scores the dataset and converts to float32. In most cases, values are converted to float32 before being passed into the model.
  4. Yes, masks should be images. See the tutorials on mask creation.
  5. At present we don't have that, but it's on the list of tasks to do.
nrweir commented 5 years ago

Can you give any more information about the imagery you're using?

If I had to guess, the above error is likely due to the DropChannel augmenter being left in xdxd_spacenet.yml when input images are only 3 channels. That augmenter is included to drop a 4th channel that's present in the SpaceNet imagery used to train XD_XD's model before feeding imagery into the model, and therefore will introduce an error if your dataset doesn't contain 4 channels.

williamobrein commented 5 years ago

In response to your questions:

  1. Currently we support tif strongly and png weakly. GeoTIFF is strongly recommended if your labels are georegistered.
  2. If you're using a pre-trained model from solaris, use the same channel order described in the model spec. If you're training your own model, it doesn't matter.
  3. Same as (2), this only really matters if you're using a pretrained model from solaris, in which case you should normalize the same way as that model does. For example, the xdxd pretrained model z-scores the dataset and converts to float32. In most cases, values are converted to float32 before being passed into the model.
  4. Yes, masks should be images. See the tutorials on mask creation.
  5. At present we don't have that, but it's on the list of tasks to do.

Thank you very much for the quick return.

  1. My data is in GeoTIFF format. So I guess it's okay.
  2. I'm experimenting with Solaris's pre-trained model. I use XD_XD because it is more successful than others. I couldn't find any information about the channel order on the link you specified. Is that the training input information you're talking about? I'm sorry if I missed it. I think you mentioned that you train images as BGR on a issues I read before. (212)
  3. I don't need to do anything if the pre-trained models are preprocessing the data. Just prepare the training and test sets as you specified in the document and edit the configuration file.
  4. I have reviewed the documents in detail. However, only the mask is created in the mask creation steps. I used the following code to export my mask as a tif. I guess there's nothing wrong here? Is it okay if the format is tif?
im = Image.fromarray(fp_mask) 
im.save("mask.tif")
williamobrein commented 5 years ago

Can you give any more information about the imagery you're using?

If I had to guess, the above error is likely due to the DropChannel augmenter being left in xdxd_spacenet.yml when input images are only 3 channels. That augmenter is included to drop a 4th channel that's present in the SpaceNet imagery used to train XD_XD's model before feeding imagery into the model, and therefore will introduce an error if your dataset doesn't contain 4 channels.

My dataset is 3-channel(RGB). Does the Spacenet dataset have RGBA channels? So the 4th channel is Alpha?

If the input of the model accepts 3 channels, I think I need to remove your preprocessing process. Am I right? Can you help with this? Or is there any other way you suggest I should follow?

Below is the information of a tile after the gdalinfo command.

(solaris) deposerver@ubuntu:/mnt/depo1tb/yz/solaris/$ gdalinfo ../tile.tif 
Driver: GTiff/GeoTIFF
Files: ../tile.tif
Size is 512, 512
Coordinate System is:
GEOGCS["WGS 84",
    DATUM["WGS_1984",
        SPHEROID["WGS 84",6378137,298.257223563,
            AUTHORITY["EPSG","7030"]],
        AUTHORITY["EPSG","6326"]],
    PRIMEM["Greenwich",0],
    UNIT["degree",0.0174532925199433],
    AUTHORITY["EPSG","4326"]]
Origin = (27.119808000000138,38.371342000000169)
Pixel Size = (0.000003000000000,-0.000003000000000)
Metadata:
  AREA_OR_POINT=Area
Image Structure Metadata:
  COMPRESSION=LZW
  INTERLEAVE=PIXEL
Corner Coordinates:
Upper Left  (  27.1198080,  38.3713420) ( 27d 7'11.31"E, 38d22'16.83"N)
Lower Left  (  27.1198080,  38.3698060) ( 27d 7'11.31"E, 38d22'11.30"N)
Upper Right (  27.1213440,  38.3713420) ( 27d 7'16.84"E, 38d22'16.83"N)
Lower Right (  27.1213440,  38.3698060) ( 27d 7'16.84"E, 38d22'11.30"N)
Center      (  27.1205760,  38.3705740) ( 27d 7'14.07"E, 38d22'14.07"N)
Band 1 Block=512x5 Type=Byte, ColorInterp=Red
Band 2 Block=512x5 Type=Byte, ColorInterp=Green
Band 3 Block=512x5 Type=Byte, ColorInterp=Blue
nrweir commented 5 years ago

Hi @williamobrein,

My data is in GeoTIFF format. So I guess it's okay.

Yes! It should be!

I'm experimenting with Solaris's pre-trained model. I use XD_XD because it is more successful than others. I couldn't find any information about the channel order on the link you specified. Is that the training input information you're talking about? I'm sorry if I missed it. I think you mentioned that you train images as BGR on a issues I read before. (212)

My apologies for the confusion - I thought it was delineated there, but looking again I see I'm incorrect. Yes, it's BGR. There's a SwapChannels augmentation in the dev branch (not yet merged into master) that you could use to convert your images to the correct channel order in-line if you want.

I don't need to do anything if the pre-trained models are preprocessing the data. Just prepare the training and test sets as you specified in the document and edit the configuration file.

That's correct - the only thing you'll need to do is calculate the mean and standard deviation for each channel (then divide by the bit depth, because that's how albumentations works, which is what solaris uses) and replace those values in the config file. You don't need to do the pre-processing yourself - it's done in-line.

I have reviewed the documents in detail. However, only the mask is created in the mask creation steps. I used the following code to export my mask as a tif. I guess there's nothing wrong here? Is it okay if the format is tif?

Yes, TIF works! That should've been specified there, sorry it's unclear.

My dataset is 3-channel(RGB). Does the Spacenet dataset have RGBA channels? So the 4th channel is Alpha?

Actually the 4th channel in the SpaceNet Atlanta dataset is Near-IR, which gets dropped.

If the input of the model accepts 3 channels, I think I need to remove your preprocessing process. Am I right? Can you help with this? Or is there any other way you suggest I should follow?

Yes, you need to remove the DropChannel augmenter from the pipeline - just delete those lines from the config file you're using. Make sure to do it from training_augmentations, validation_augmentations, and inference_augmentations pieces of the config file. If you're using the version of Solaris available in the dev branch, you can also add a SwapChannels augmenter to swap the R and B channels in-line rather than having to re-make all of your image files. So, something like this should work:

training_augmentation:
  augmentations:
    SwapChannels:
      first_idx: 0
      second_idx: 2
      p: 1.0
    HorizontalFlip:
      p: 0.5
    RandomRotate90:
      p: 0.5
    RandomCrop:
      height: 512
      width: 512
      p: 1.0
    Normalize:  # you'll need to fix these values for your own dataset
      mean:
        - 0.006479
        - 0.009328
        - 0.01123
      std:
        - 0.004986
        - 0.004964
        - 0.004950
      max_pixel_value: 65535.0
      p: 1.0
  p: 1.0
  shuffle: true
nrweir commented 5 years ago

If this addresses these issues let me know (or just close the issue). Thanks!

williamobrein commented 5 years ago

Hi @williamobrein,

My data is in GeoTIFF format. So I guess it's okay.

Yes! It should be!

I'm experimenting with Solaris's pre-trained model. I use XD_XD because it is more successful than others. I couldn't find any information about the channel order on the link you specified. Is that the training input information you're talking about? I'm sorry if I missed it. I think you mentioned that you train images as BGR on a issues I read before. (212)

My apologies for the confusion - I thought it was delineated there, but looking again I see I'm incorrect. Yes, it's BGR. There's a SwapChannels augmentation in the dev branch (not yet merged into master) that you could use to convert your images to the correct channel order in-line if you want.

I don't need to do anything if the pre-trained models are preprocessing the data. Just prepare the training and test sets as you specified in the document and edit the configuration file.

That's correct - the only thing you'll need to do is calculate the mean and standard deviation for each channel (then divide by the bit depth, because that's how albumentations works, which is what solaris uses) and replace those values in the config file. You don't need to do the pre-processing yourself - it's done in-line.

I have reviewed the documents in detail. However, only the mask is created in the mask creation steps. I used the following code to export my mask as a tif. I guess there's nothing wrong here? Is it okay if the format is tif?

Yes, TIF works! That should've been specified there, sorry it's unclear.

My dataset is 3-channel(RGB). Does the Spacenet dataset have RGBA channels? So the 4th channel is Alpha?

Actually the 4th channel in the SpaceNet Atlanta dataset is Near-IR, which gets dropped.

If the input of the model accepts 3 channels, I think I need to remove your preprocessing process. Am I right? Can you help with this? Or is there any other way you suggest I should follow?

Yes, you need to remove the DropChannel augmenter from the pipeline - just delete those lines from the config file you're using. Make sure to do it from training_augmentations, validation_augmentations, and inference_augmentations pieces of the config file. If you're using the version of Solaris available in the dev branch, you can also add a SwapChannels augmenter to swap the R and B channels in-line rather than having to re-make all of your image files. So, something like this should work:

training_augmentation:
  augmentations:
    SwapChannels:
      first_idx: 0
      second_idx: 2
      p: 1.0
    HorizontalFlip:
      p: 0.5
    RandomRotate90:
      p: 0.5
    RandomCrop:
      height: 512
      width: 512
      p: 1.0
    Normalize:  # you'll need to fix these values for your own dataset
      mean:
        - 0.006479
        - 0.009328
        - 0.01123
      std:
        - 0.004986
        - 0.004964
        - 0.004950
      max_pixel_value: 65535.0
      p: 1.0
  p: 1.0
  shuffle: true

Sorry for the late reply. I had work to do. Thank you for the answers. I can manually convert it to BGR for now, but I would appreciate it if it merges into master branch as soon as possible!

I've solved all my previous questions. Thanks for your help. But I have new questions.

I'll calculate the mean and standard deviation for each channel then divide by the bit depth. And I will start the training. After that I'll let you know and close the issues. Don't worry about it!

  1. Using the command I gave below, I obtained the mean value and standard deviation. As you said, after calculating the average value and standard deviation, I divided it into bit depth.
gdalinfo -mm -stats data_bgr.tif

My values ​​are as follows just for one channel.

    STATISTICS_MEAN=110.12675671955
    STATISTICS_STDDEV=46.520330868899

When I divide it into bit depth(my data is 24 bit, so bit depth is 2^24 ?), I get this result.

 Mean / Bit depth(24 bit) = 4.58861486331
 Std / Bit depth(24 bit) = 1.93834711954

Is there a format for displaying these values ​​or should I write the result directly? Because your values ​​seem to be written in a format. Can you explain if I'm wrong? How exactly is this calculation process? Can you give me an example?

  1. Can we give the model input data in ecw format? (1:36 ecw compression ratio) It's about size. As I mentioned above, my data is actually ecw. I'm converting them into geotiffs. One ecw has an approximate size of 50 mb. Therefore, a very large file size. Also, this dialing process takes a long time. (If I'm not mistaken, Spacenet's pixel size is 0.0000027 arc. The pixel size of my data is 0.000003 arc)

  2. Can COGTIFF (Cloud Optimized GeoTIFF) format be an alternative? Because the file size is getting relatively less than the other.

  3. Should I change the resolution and quality? Which situation best gives my result? For the model input, we divide them all into 512x512 tiles, yes, but should we do any data processing before? Any suggestions? Do you think that our success affects success?

  4. Should I resample and change the quality (pixel size)? Does this method affect performance? And how can I resample it? (Nearest Neighborhood, etc.)

  5. Did you include non-building tiles in the training? I thought you didn't.

nrweir commented 5 years ago
  1. When I divide it into bit depth(my data is 24 bit, so bit depth is 2^24 ?), I get this result.

Mean / Bit depth(24 bit) = 4.58861486331 Std / Bit depth(24 bit) = 1.93834711954 Is there a format for displaying these values ​​or should I write the result directly? Because your values ​​seem to be written in a format. Can you explain if I'm wrong? How exactly is this calculation process? Can you give me an example?

Apologies, I wasn't super clear there: Step 1. Calculate mean and std as you did Step 2. Divide those values by the maximum value for your bit depth (so 2^24 = 16777216)

So, in your case for the mean, 110.12675671955/16777216 = 0.00000656406

Do your images use the full 24-bit range? That is, are there pixel values approaching 16777216? If not, you could truncate to 16-bit (maximum pixel value 65535) or 8-bit (max value 255) and re-save the images, which would make them substantially smaller files. 24-bit images are indeed huge.

Can we give the model input data in ecw format? (1:36 ecw compression ratio) It's about size. As I mentioned above, my data is actually ecw. I'm converting them into geotiffs. One ecw has an approximate size of 50 mb. Therefore, a very large file size. Also, this dialing process takes a long time. (If I'm not mistaken, Spacenet's pixel size is 0.0000027 arc. The pixel size of my data is 0.000003 arc)

We'd love to accommodate more image types as part of solaris, but I don't know if the maintainers will have time to add functionality for ecw support anytime soon. If this is something you're interested in adding yourself, you're welcome to do so - see the contributing guidelines for details. If you personally won't have time to do so, I encourage you to create a separate issue delineating what you would like to have done, both for our tracking purposes and so another contributor could potentially take up the task.

Can COGTIFF (Cloud Optimized GeoTIFF) format be an alternative? Because the file size is getting relatively less than the other.

Yes, this is something we're exploring (see #163). Development time has limited our ability to implement it so far, however...another area where a community contributor would be welcome to step in.

Should I change the resolution and quality? Which situation best gives my result? For the model input, we divide them all into 512x512 tiles, yes, but should we do any data processing before? Any suggestions? Do you think that our success affects success?

Resolution, re-scaling, and how they impact model performance remains to some degree an open question in the field. _If you're trying to directly use the pre-trained weights from XDXD's model without any fine-tuning, you will want pixel size to be very similar to the SpaceNet Atlanta Data (0.5 m/px). If you're fine-tuning the model weights using your own dataset, or re-training completely, it shouldn't matter as much.

Should I resample and change the quality (pixel size)? Does this method affect performance? And how can I resample it? (Nearest Neighborhood, etc.)

I don't have a great answer for this beyond what I said in response to the last question. My personal opinion: since re-sampling will almost always result in some loss of information from the source data, it should be avoided when possible; however, I'm not aware of any studies that have directly examined the validity of my assumption. If you do resample, I recommend bilinear or bicubic resampling. For example, in the SpaceNet Atlanta dataset, all of the different collects were re-sampled to 0.5 m/px using bilinear resampling to ensure consistency within the dataset.

Did you include non-building tiles in the training? I thought you didn't.

Yes we did - we included the full SpaceNet Atlanta training set, which includes non-building tiles. Particularly if your testing dataset is likely to include non-building tiles, this can be valuable.

williamobrein commented 5 years ago

Apologies, I wasn't super clear there: Step 1. Calculate mean and std as you did Step 2. Divide those values by the maximum value for your bit depth (so 2^24 = 16777216)

So, in your case for the mean, 110.12675671955/16777216 = 0.00000656406

Thanks, that's what I did. Nice to confirm. Why do we give the maximum pixel value, standard deviation, and mean value? I've never seen anything like that in traditional object detection models. Can you explain? Or do you have a source to suggest?

Spacenet Atlanta data is not RGB? How can you use 16 bits? I think you need to use 24 bits. (R[8]G[8]B[8] = 24) What's the difference?

There are several tif files in the Spacenet Atlanta data set. How did you find their mean value and standard deviation? Did you merge it into one piece? Or do you have a different method? I have a lot of tif files. I don't know how to calculate the standard deviation and mean value of each part separately and then combine them. It takes a long time to make them all in one file and requires high processing power.

Do your images use the full 24-bit range? That is, are there pixel values approaching 16777216? If not, you could truncate to 16-bit (maximum pixel value 65535) or 8-bit (max value 255) and re-save the images, which would make them substantially smaller files. 24-bit images are indeed huge.

Yes you are right! 24-bit images are too large. I will consider your suggestions on this issue. I'm gonna check my images and make edits.

We'd love to accommodate more image types as part of solaris, but I don't know if the maintainers will have time to add functionality for ecw support anytime soon. If this is something you're interested in adding yourself, you're welcome to do so - see the contributing guidelines for details. If you personally won't have time to do so, I encourage you to create a separate issue delineating what you would like to have done, both for our tracking purposes and so another contributor could potentially take up the task.

I understand you, very well. I don't have time to improve it anytime soon. But I can open a issues like you said. This is useful for those who want to improve it.

Yes, this is something we're exploring (see #163). Development time has limited our ability to implement it so far, however...another area where a community contributor would be welcome to step in.

I understand, I'm going to do some research on this. I'il let you know if I find anything.

Resolution, re-scaling, and how they impact model performance remains to some degree an open question in the field. _If you're trying to directly use the pre-trained weights from XDXD's model without any fine-tuning, you will want pixel size to be very similar to the SpaceNet Atlanta Data (0.5 m/px). If you're fine-tuning the model weights using your own dataset, or re-training completely, it shouldn't matter as much.

Thank you, that's what I thought. I just wanted to get your opinion. Maybe if I will train completely from the beginning, I might try.

I don't have a great answer for this beyond what I said in response to the last question. My personal opinion: since re-sampling will almost always result in some loss of information from the source data, it should be avoided when possible; however, I'm not aware of any studies that have directly examined the validity of my assumption. If you do resample, I recommend bilinear or bicubic resampling. For example, in the SpaceNet Atlanta dataset, all of the different collects were re-sampled to 0.5 m/px using bilinear resampling to ensure consistency within the dataset.

Actually, I didn't because I thought there would be data loss. But I wanted to ask you maybe you know an academic resource. My data is consistent on this. I don't need any sampling at this time.

Yes we did - we included the full SpaceNet Atlanta training set, which includes non-building tiles. Particularly if your testing dataset is likely to include non-building tiles, this can be valuable.

What is the purpose of including non-building images? While learning the images of buildings on the model, doesn't he learn the rest of the places? I'm confused here. Is there a logic different from recognizing objects?

nrweir commented 5 years ago

Thanks, that's what I did. Nice to confirm. Why do we give the maximum pixel value, standard deviation, and mean value? I've never seen anything like that in traditional object detection models. Can you explain? Or do you have a source to suggest?

A fairly common practice for many computer vision models is to either z-score pixel intensities or normalize them to a 0-1 range. In this case, we're using the albumentations library to run z-scoring. albumentations.Normalize expects the mean and standard deviations to be scaled per the image's max pixel intensity - that's just the way the library is set up.

Normalization is important to achieve consistent performance across images from different sensors/collects. For example, SpaceNet Atlanta's pixel intensities are mostly between 0 and 1200; if you provided images that ranged in pixel values from 100 to 200, the model would likely have no idea how to generate valid predictions. We show this in the 4th notebook in the Solaris FOSS4G tutorial.

Spacenet Atlanta data is not RGB? How can you use 16 bits? I think you need to use 24 bits. (R[8]G[8]B[8] = 24) What's the difference?

Apologies, I was providing per-channel bit depth. Every channel (R, G, B, and near-IR) are encoded as 16-bit values. Looks like I misinterpreted your description of your image - I took 24 bit to mean 24 bits per channel.

What is the purpose of including non-building images? While learning the images of buildings on the model, doesn't he learn the rest of the places? I'm confused here. Is there a logic different from recognizing objects?

Though I haven't explored this in great detail personally, my expectation is that during training, the model learns the distribution of number of building pixels per image to some degree. As U-Nets utilize both whole-image information (at the middle layers) as well as fine-grained information (in the beginning and end), I could envision a model learning that it should never predict zero building pixels if the training set it's provided never has zero building pixels. Generally, best practices recommend matching your training and testing datasets' distributions to one another, and if your test set includes building-free images, we believe that training should too. Segmentation models do indeed work differently from object detection models, which generally provide proposals and classification and then use NMS to filter out bad predictions. I could envision these two handling variation between training and testing set distributions differently.

williamobrein commented 5 years ago

A fairly common practice for many computer vision models is to either z-score pixel intensities or normalize them to a 0-1 range. In this case, we're using the albumentations library to run z-scoring. albumentations.Normalize expects the mean and standard deviations to be scaled per the image's max pixel intensity - that's just the way the library is set up.

I understand. I need to study Albumentations. Thanks for the lead.

Normalization is important to achieve consistent performance across images from different sensors/collects. For example, SpaceNet Atlanta's pixel intensities are mostly between 0 and 1200; if you provided images that ranged in pixel values from 100 to 200, the model would likely have no idea how to generate valid predictions. We show this in the 4th notebook in the Solaris FOSS4G tutorial.

I know that normalization is important in image detection. But I don't know about the z-score. I think I need to figure out that. I'll check the notebooks. Thank you!

Apologies, I was providing per-channel bit depth. Every channel (R, G, B, and near-IR) are encoded as 16-bit values. Looks like I misinterpreted your description of your image - I took 24 bit to mean 24 bits per channel.

Now everything is clear. I'm going to review my image and make it 16 bits per channel and start training like that. Thank you!

Though I haven't explored this in great detail personally, my expectation is that during training, the model learns the distribution of number of building pixels per image to some degree. As U-Nets utilize both whole-image information (at the middle layers) as well as fine-grained information (in the beginning and end), I could envision a model learning that it should never predict zero building pixels if the training set it's provided never has zero building pixels. Generally, best practices recommend matching your training and testing datasets' distributions to one another, and if your test set includes building-free images, we believe that training should too. Segmentation models do indeed work differently from object detection models, which generally provide proposals and classification and then use NMS to filter out bad predictions. I could envision these two handling variation between training and testing set distributions differently.

I thought that there are already other areas in the images that contain buildings. I mean areas without buildings. I guess that's how traditional object recognition works. Subject to change depending on the model. But what you say makes sense. Of course, there will be parts in my test set that do not include buildings. I'm gonna edit my data based on what we're talking about!

I think we can close this. Thanks again for everything!

williamobrein commented 5 years ago

Can we run solaris preprocessing on the GPU? Because it takes too long on the CPU.

nrweir commented 5 years ago

Do you mean pre-processing in terms of tiling or image augmentation before it's fed into the model?

Either way, the present answer is no but an enterprising user would be welcome to make a PR.

If you want to encourage that, I'd recommend creating a new issue here for it - I'm going to close this one since we've moved fairly far afield from the original question.