Closed williamobrein closed 5 years ago
Hi @williamobrein,
Thanks for putting this in. I'll have a look and get back to you as soon as I can.
-N
In response to your questions:
Can you give any more information about the imagery you're using?
If I had to guess, the above error is likely due to the DropChannel
augmenter being left in xdxd_spacenet.yml
when input images are only 3 channels. That augmenter is included to drop a 4th channel that's present in the SpaceNet imagery used to train XD_XD's model before feeding imagery into the model, and therefore will introduce an error if your dataset doesn't contain 4 channels.
In response to your questions:
- Currently we support tif strongly and png weakly. GeoTIFF is strongly recommended if your labels are georegistered.
- If you're using a pre-trained model from solaris, use the same channel order described in the model spec. If you're training your own model, it doesn't matter.
- Same as (2), this only really matters if you're using a pretrained model from solaris, in which case you should normalize the same way as that model does. For example, the xdxd pretrained model z-scores the dataset and converts to float32. In most cases, values are converted to float32 before being passed into the model.
- Yes, masks should be images. See the tutorials on mask creation.
- At present we don't have that, but it's on the list of tasks to do.
Thank you very much for the quick return.
im = Image.fromarray(fp_mask)
im.save("mask.tif")
Can you give any more information about the imagery you're using?
If I had to guess, the above error is likely due to the
DropChannel
augmenter being left inxdxd_spacenet.yml
when input images are only 3 channels. That augmenter is included to drop a 4th channel that's present in the SpaceNet imagery used to train XD_XD's model before feeding imagery into the model, and therefore will introduce an error if your dataset doesn't contain 4 channels.
My dataset is 3-channel(RGB). Does the Spacenet dataset have RGBA channels? So the 4th channel is Alpha?
If the input of the model accepts 3 channels, I think I need to remove your preprocessing process. Am I right? Can you help with this? Or is there any other way you suggest I should follow?
Below is the information of a tile after the gdalinfo
command.
(solaris) deposerver@ubuntu:/mnt/depo1tb/yz/solaris/$ gdalinfo ../tile.tif
Driver: GTiff/GeoTIFF
Files: ../tile.tif
Size is 512, 512
Coordinate System is:
GEOGCS["WGS 84",
DATUM["WGS_1984",
SPHEROID["WGS 84",6378137,298.257223563,
AUTHORITY["EPSG","7030"]],
AUTHORITY["EPSG","6326"]],
PRIMEM["Greenwich",0],
UNIT["degree",0.0174532925199433],
AUTHORITY["EPSG","4326"]]
Origin = (27.119808000000138,38.371342000000169)
Pixel Size = (0.000003000000000,-0.000003000000000)
Metadata:
AREA_OR_POINT=Area
Image Structure Metadata:
COMPRESSION=LZW
INTERLEAVE=PIXEL
Corner Coordinates:
Upper Left ( 27.1198080, 38.3713420) ( 27d 7'11.31"E, 38d22'16.83"N)
Lower Left ( 27.1198080, 38.3698060) ( 27d 7'11.31"E, 38d22'11.30"N)
Upper Right ( 27.1213440, 38.3713420) ( 27d 7'16.84"E, 38d22'16.83"N)
Lower Right ( 27.1213440, 38.3698060) ( 27d 7'16.84"E, 38d22'11.30"N)
Center ( 27.1205760, 38.3705740) ( 27d 7'14.07"E, 38d22'14.07"N)
Band 1 Block=512x5 Type=Byte, ColorInterp=Red
Band 2 Block=512x5 Type=Byte, ColorInterp=Green
Band 3 Block=512x5 Type=Byte, ColorInterp=Blue
Hi @williamobrein,
My data is in GeoTIFF format. So I guess it's okay.
Yes! It should be!
I'm experimenting with Solaris's pre-trained model. I use XD_XD because it is more successful than others. I couldn't find any information about the channel order on the link you specified. Is that the training input information you're talking about? I'm sorry if I missed it. I think you mentioned that you train images as BGR on a issues I read before. (212)
My apologies for the confusion - I thought it was delineated there, but looking again I see I'm incorrect. Yes, it's BGR. There's a SwapChannels
augmentation in the dev branch (not yet merged into master) that you could use to convert your images to the correct channel order in-line if you want.
I don't need to do anything if the pre-trained models are preprocessing the data. Just prepare the training and test sets as you specified in the document and edit the configuration file.
That's correct - the only thing you'll need to do is calculate the mean and standard deviation for each channel (then divide by the bit depth, because that's how albumentations
works, which is what solaris
uses) and replace those values in the config file. You don't need to do the pre-processing yourself - it's done in-line.
I have reviewed the documents in detail. However, only the mask is created in the mask creation steps. I used the following code to export my mask as a tif. I guess there's nothing wrong here? Is it okay if the format is tif?
Yes, TIF works! That should've been specified there, sorry it's unclear.
My dataset is 3-channel(RGB). Does the Spacenet dataset have RGBA channels? So the 4th channel is Alpha?
Actually the 4th channel in the SpaceNet Atlanta dataset is Near-IR, which gets dropped.
If the input of the model accepts 3 channels, I think I need to remove your preprocessing process. Am I right? Can you help with this? Or is there any other way you suggest I should follow?
Yes, you need to remove the DropChannel
augmenter from the pipeline - just delete those lines from the config file you're using. Make sure to do it from training_augmentations
, validation_augmentations
, and inference_augmentations
pieces of the config file. If you're using the version of Solaris available in the dev branch, you can also add a SwapChannels
augmenter to swap the R and B channels in-line rather than having to re-make all of your image files. So, something like this should work:
training_augmentation:
augmentations:
SwapChannels:
first_idx: 0
second_idx: 2
p: 1.0
HorizontalFlip:
p: 0.5
RandomRotate90:
p: 0.5
RandomCrop:
height: 512
width: 512
p: 1.0
Normalize: # you'll need to fix these values for your own dataset
mean:
- 0.006479
- 0.009328
- 0.01123
std:
- 0.004986
- 0.004964
- 0.004950
max_pixel_value: 65535.0
p: 1.0
p: 1.0
shuffle: true
If this addresses these issues let me know (or just close the issue). Thanks!
Hi @williamobrein,
My data is in GeoTIFF format. So I guess it's okay.
Yes! It should be!
I'm experimenting with Solaris's pre-trained model. I use XD_XD because it is more successful than others. I couldn't find any information about the channel order on the link you specified. Is that the training input information you're talking about? I'm sorry if I missed it. I think you mentioned that you train images as BGR on a issues I read before. (212)
My apologies for the confusion - I thought it was delineated there, but looking again I see I'm incorrect. Yes, it's BGR. There's a
SwapChannels
augmentation in the dev branch (not yet merged into master) that you could use to convert your images to the correct channel order in-line if you want.I don't need to do anything if the pre-trained models are preprocessing the data. Just prepare the training and test sets as you specified in the document and edit the configuration file.
That's correct - the only thing you'll need to do is calculate the mean and standard deviation for each channel (then divide by the bit depth, because that's how
albumentations
works, which is whatsolaris
uses) and replace those values in the config file. You don't need to do the pre-processing yourself - it's done in-line.I have reviewed the documents in detail. However, only the mask is created in the mask creation steps. I used the following code to export my mask as a tif. I guess there's nothing wrong here? Is it okay if the format is tif?
Yes, TIF works! That should've been specified there, sorry it's unclear.
My dataset is 3-channel(RGB). Does the Spacenet dataset have RGBA channels? So the 4th channel is Alpha?
Actually the 4th channel in the SpaceNet Atlanta dataset is Near-IR, which gets dropped.
If the input of the model accepts 3 channels, I think I need to remove your preprocessing process. Am I right? Can you help with this? Or is there any other way you suggest I should follow?
Yes, you need to remove the
DropChannel
augmenter from the pipeline - just delete those lines from the config file you're using. Make sure to do it fromtraining_augmentations
,validation_augmentations
, andinference_augmentations
pieces of the config file. If you're using the version of Solaris available in the dev branch, you can also add aSwapChannels
augmenter to swap the R and B channels in-line rather than having to re-make all of your image files. So, something like this should work:training_augmentation: augmentations: SwapChannels: first_idx: 0 second_idx: 2 p: 1.0 HorizontalFlip: p: 0.5 RandomRotate90: p: 0.5 RandomCrop: height: 512 width: 512 p: 1.0 Normalize: # you'll need to fix these values for your own dataset mean: - 0.006479 - 0.009328 - 0.01123 std: - 0.004986 - 0.004964 - 0.004950 max_pixel_value: 65535.0 p: 1.0 p: 1.0 shuffle: true
Sorry for the late reply. I had work to do. Thank you for the answers. I can manually convert it to BGR for now, but I would appreciate it if it merges into master branch as soon as possible!
I've solved all my previous questions. Thanks for your help. But I have new questions.
I'll calculate the mean and standard deviation for each channel then divide by the bit depth. And I will start the training. After that I'll let you know and close the issues. Don't worry about it!
gdalinfo -mm -stats data_bgr.tif
My values are as follows just for one channel.
STATISTICS_MEAN=110.12675671955
STATISTICS_STDDEV=46.520330868899
When I divide it into bit depth(my data is 24 bit, so bit depth is 2^24 ?), I get this result.
Mean / Bit depth(24 bit) = 4.58861486331
Std / Bit depth(24 bit) = 1.93834711954
Is there a format for displaying these values or should I write the result directly? Because your values seem to be written in a format. Can you explain if I'm wrong? How exactly is this calculation process? Can you give me an example?
Can we give the model input data in ecw format? (1:36 ecw compression ratio) It's about size. As I mentioned above, my data is actually ecw. I'm converting them into geotiffs. One ecw has an approximate size of 50 mb. Therefore, a very large file size. Also, this dialing process takes a long time. (If I'm not mistaken, Spacenet's pixel size is 0.0000027 arc. The pixel size of my data is 0.000003 arc)
Can COGTIFF (Cloud Optimized GeoTIFF) format be an alternative? Because the file size is getting relatively less than the other.
Should I change the resolution and quality? Which situation best gives my result? For the model input, we divide them all into 512x512 tiles, yes, but should we do any data processing before? Any suggestions? Do you think that our success affects success?
Should I resample and change the quality (pixel size)? Does this method affect performance? And how can I resample it? (Nearest Neighborhood, etc.)
Did you include non-building tiles in the training? I thought you didn't.
- When I divide it into bit depth(my data is 24 bit, so bit depth is 2^24 ?), I get this result.
Mean / Bit depth(24 bit) = 4.58861486331 Std / Bit depth(24 bit) = 1.93834711954 Is there a format for displaying these values or should I write the result directly? Because your values seem to be written in a format. Can you explain if I'm wrong? How exactly is this calculation process? Can you give me an example?
Apologies, I wasn't super clear there: Step 1. Calculate mean and std as you did Step 2. Divide those values by the maximum value for your bit depth (so 2^24 = 16777216)
So, in your case for the mean, 110.12675671955/16777216 = 0.00000656406
Do your images use the full 24-bit range? That is, are there pixel values approaching 16777216? If not, you could truncate to 16-bit (maximum pixel value 65535) or 8-bit (max value 255) and re-save the images, which would make them substantially smaller files. 24-bit images are indeed huge.
Can we give the model input data in ecw format? (1:36 ecw compression ratio) It's about size. As I mentioned above, my data is actually ecw. I'm converting them into geotiffs. One ecw has an approximate size of 50 mb. Therefore, a very large file size. Also, this dialing process takes a long time. (If I'm not mistaken, Spacenet's pixel size is 0.0000027 arc. The pixel size of my data is 0.000003 arc)
We'd love to accommodate more image types as part of solaris
, but I don't know if the maintainers will have time to add functionality for ecw support anytime soon. If this is something you're interested in adding yourself, you're welcome to do so - see the contributing guidelines for details. If you personally won't have time to do so, I encourage you to create a separate issue delineating what you would like to have done, both for our tracking purposes and so another contributor could potentially take up the task.
Can COGTIFF (Cloud Optimized GeoTIFF) format be an alternative? Because the file size is getting relatively less than the other.
Yes, this is something we're exploring (see #163). Development time has limited our ability to implement it so far, however...another area where a community contributor would be welcome to step in.
Should I change the resolution and quality? Which situation best gives my result? For the model input, we divide them all into 512x512 tiles, yes, but should we do any data processing before? Any suggestions? Do you think that our success affects success?
Resolution, re-scaling, and how they impact model performance remains to some degree an open question in the field. _If you're trying to directly use the pre-trained weights from XDXD's model without any fine-tuning, you will want pixel size to be very similar to the SpaceNet Atlanta Data (0.5 m/px). If you're fine-tuning the model weights using your own dataset, or re-training completely, it shouldn't matter as much.
Should I resample and change the quality (pixel size)? Does this method affect performance? And how can I resample it? (Nearest Neighborhood, etc.)
I don't have a great answer for this beyond what I said in response to the last question. My personal opinion: since re-sampling will almost always result in some loss of information from the source data, it should be avoided when possible; however, I'm not aware of any studies that have directly examined the validity of my assumption. If you do resample, I recommend bilinear or bicubic resampling. For example, in the SpaceNet Atlanta dataset, all of the different collects were re-sampled to 0.5 m/px using bilinear resampling to ensure consistency within the dataset.
Did you include non-building tiles in the training? I thought you didn't.
Yes we did - we included the full SpaceNet Atlanta training set, which includes non-building tiles. Particularly if your testing dataset is likely to include non-building tiles, this can be valuable.
Apologies, I wasn't super clear there: Step 1. Calculate mean and std as you did Step 2. Divide those values by the maximum value for your bit depth (so 2^24 = 16777216)
So, in your case for the mean,
110.12675671955/16777216 = 0.00000656406
Thanks, that's what I did. Nice to confirm. Why do we give the maximum pixel value, standard deviation, and mean value? I've never seen anything like that in traditional object detection models. Can you explain? Or do you have a source to suggest?
Spacenet Atlanta data is not RGB? How can you use 16 bits? I think you need to use 24 bits. (R[8]G[8]B[8] = 24) What's the difference?
There are several tif files in the Spacenet Atlanta data set. How did you find their mean value and standard deviation? Did you merge it into one piece? Or do you have a different method? I have a lot of tif files. I don't know how to calculate the standard deviation and mean value of each part separately and then combine them. It takes a long time to make them all in one file and requires high processing power.
Do your images use the full 24-bit range? That is, are there pixel values approaching 16777216? If not, you could truncate to 16-bit (maximum pixel value 65535) or 8-bit (max value 255) and re-save the images, which would make them substantially smaller files. 24-bit images are indeed huge.
Yes you are right! 24-bit images are too large. I will consider your suggestions on this issue. I'm gonna check my images and make edits.
We'd love to accommodate more image types as part of
solaris
, but I don't know if the maintainers will have time to add functionality for ecw support anytime soon. If this is something you're interested in adding yourself, you're welcome to do so - see the contributing guidelines for details. If you personally won't have time to do so, I encourage you to create a separate issue delineating what you would like to have done, both for our tracking purposes and so another contributor could potentially take up the task.
I understand you, very well. I don't have time to improve it anytime soon. But I can open a issues like you said. This is useful for those who want to improve it.
Yes, this is something we're exploring (see #163). Development time has limited our ability to implement it so far, however...another area where a community contributor would be welcome to step in.
I understand, I'm going to do some research on this. I'il let you know if I find anything.
Resolution, re-scaling, and how they impact model performance remains to some degree an open question in the field. _If you're trying to directly use the pre-trained weights from XDXD's model without any fine-tuning, you will want pixel size to be very similar to the SpaceNet Atlanta Data (0.5 m/px). If you're fine-tuning the model weights using your own dataset, or re-training completely, it shouldn't matter as much.
Thank you, that's what I thought. I just wanted to get your opinion. Maybe if I will train completely from the beginning, I might try.
I don't have a great answer for this beyond what I said in response to the last question. My personal opinion: since re-sampling will almost always result in some loss of information from the source data, it should be avoided when possible; however, I'm not aware of any studies that have directly examined the validity of my assumption. If you do resample, I recommend bilinear or bicubic resampling. For example, in the SpaceNet Atlanta dataset, all of the different collects were re-sampled to 0.5 m/px using bilinear resampling to ensure consistency within the dataset.
Actually, I didn't because I thought there would be data loss. But I wanted to ask you maybe you know an academic resource. My data is consistent on this. I don't need any sampling at this time.
Yes we did - we included the full SpaceNet Atlanta training set, which includes non-building tiles. Particularly if your testing dataset is likely to include non-building tiles, this can be valuable.
What is the purpose of including non-building images? While learning the images of buildings on the model, doesn't he learn the rest of the places? I'm confused here. Is there a logic different from recognizing objects?
Thanks, that's what I did. Nice to confirm. Why do we give the maximum pixel value, standard deviation, and mean value? I've never seen anything like that in traditional object detection models. Can you explain? Or do you have a source to suggest?
A fairly common practice for many computer vision models is to either z-score pixel intensities or normalize them to a 0-1 range. In this case, we're using the albumentations library to run z-scoring. albumentations.Normalize
expects the mean and standard deviations to be scaled per the image's max pixel intensity - that's just the way the library is set up.
Normalization is important to achieve consistent performance across images from different sensors/collects. For example, SpaceNet Atlanta's pixel intensities are mostly between 0 and 1200; if you provided images that ranged in pixel values from 100 to 200, the model would likely have no idea how to generate valid predictions. We show this in the 4th notebook in the Solaris FOSS4G tutorial.
Spacenet Atlanta data is not RGB? How can you use 16 bits? I think you need to use 24 bits. (R[8]G[8]B[8] = 24) What's the difference?
Apologies, I was providing per-channel bit depth. Every channel (R, G, B, and near-IR) are encoded as 16-bit values. Looks like I misinterpreted your description of your image - I took 24 bit to mean 24 bits per channel.
What is the purpose of including non-building images? While learning the images of buildings on the model, doesn't he learn the rest of the places? I'm confused here. Is there a logic different from recognizing objects?
Though I haven't explored this in great detail personally, my expectation is that during training, the model learns the distribution of number of building pixels per image to some degree. As U-Nets utilize both whole-image information (at the middle layers) as well as fine-grained information (in the beginning and end), I could envision a model learning that it should never predict zero building pixels if the training set it's provided never has zero building pixels. Generally, best practices recommend matching your training and testing datasets' distributions to one another, and if your test set includes building-free images, we believe that training should too. Segmentation models do indeed work differently from object detection models, which generally provide proposals and classification and then use NMS to filter out bad predictions. I could envision these two handling variation between training and testing set distributions differently.
A fairly common practice for many computer vision models is to either z-score pixel intensities or normalize them to a 0-1 range. In this case, we're using the albumentations library to run z-scoring.
albumentations.Normalize
expects the mean and standard deviations to be scaled per the image's max pixel intensity - that's just the way the library is set up.
I understand. I need to study Albumentations. Thanks for the lead.
Normalization is important to achieve consistent performance across images from different sensors/collects. For example, SpaceNet Atlanta's pixel intensities are mostly between 0 and 1200; if you provided images that ranged in pixel values from 100 to 200, the model would likely have no idea how to generate valid predictions. We show this in the 4th notebook in the Solaris FOSS4G tutorial.
I know that normalization is important in image detection. But I don't know about the z-score. I think I need to figure out that. I'll check the notebooks. Thank you!
Apologies, I was providing per-channel bit depth. Every channel (R, G, B, and near-IR) are encoded as 16-bit values. Looks like I misinterpreted your description of your image - I took 24 bit to mean 24 bits per channel.
Now everything is clear. I'm going to review my image and make it 16 bits per channel and start training like that. Thank you!
Though I haven't explored this in great detail personally, my expectation is that during training, the model learns the distribution of number of building pixels per image to some degree. As U-Nets utilize both whole-image information (at the middle layers) as well as fine-grained information (in the beginning and end), I could envision a model learning that it should never predict zero building pixels if the training set it's provided never has zero building pixels. Generally, best practices recommend matching your training and testing datasets' distributions to one another, and if your test set includes building-free images, we believe that training should too. Segmentation models do indeed work differently from object detection models, which generally provide proposals and classification and then use NMS to filter out bad predictions. I could envision these two handling variation between training and testing set distributions differently.
I thought that there are already other areas in the images that contain buildings. I mean areas without buildings. I guess that's how traditional object recognition works. Subject to change depending on the model. But what you say makes sense. Of course, there will be parts in my test set that do not include buildings. I'm gonna edit my data based on what we're talking about!
I think we can close this. Thanks again for everything!
Can we run solaris
preprocessing on the GPU? Because it takes too long on the CPU.
Do you mean pre-processing in terms of tiling or image augmentation before it's fed into the model?
Either way, the present answer is no but an enterprising user would be welcome to make a PR.
If you want to encourage that, I'd recommend creating a new issue here for it - I'm going to close this one since we've moved fairly far afield from the original question.
Hello, first of all thank you for developing
solaris
. I've been working on object detection for a long time. But I'm new to Github. So I'm sorry for my faults!I tried to train with my own data but I got an error. I received an error:
IndexError: index 3 is out of bounds for axis 2 with size 3
As you mentioned in the document, I divided the satellite image (in tif format) into tiles. Then I divided geojson files in the same way. I did the mask creation process. Again I created my mask(footprint mask) in tif format. Then I created the training and test csv files as you specified. I have edited the configuration file of the pre-trained model
xdxd_spacenet4
.Error Message
When I run this command, I get the error like above.
How can I solve this problem? Anybody have any ideas? Thanks in advance.
What should I do?
I have some questions. It would be very helpful if you could help.
Environment information
solaris
version: 0.1.3