Open MiaoRain opened 4 years ago
5th Input Preprocessing and Input Size: Since I used Imagenet pretrained models which required three-channel color input, each input image in this competition was duplicated two times and transformed into color image. There was no other preprocessing performed. The input image size for model training was fixed to the original size of the images: 256x1600. During early stage of the model development, I roughly tried cropping a smaller size, e.g., 224x768 for training and the original size for testing, but the public leaderboard result with cropped input turned out to be slightly worse. Another reason to not using a cropped version for training was because there would be no benefit for inference speed.
Image Augmentation: Image augmentation is one of the key factors for the successful training of deep neural nets. This was particularly true for this competition because of the label imbalance. There are a lot more negative samples compared with positive ones. When using random input sampling to form training batches, the model would tend to predict with less confidence the positive samples after the first dozens of epochs training, which means we would need smaller thresholds (rather than 0.5) to segment the positive regions, and these thresholds could extra hyperprameters we would want to avoid that. This problem would gradually disappear after the model was trained long enough, but the risk of overfitting (due to overtraining) would increase, which could be solved with the help of heavy augmentation. The augmentations I used includes crop and resize back, contrast and brightness, gamma correction, blurring and sharpening, scale and shift (no rotation and shearing), horizontal and vertical mirroring. The crop and resize back augmentation (the randomcrop function in any of the train.py) first crops horizontally and vertically a random portion, for example, crop 90% horizontally and 80% vertically, and then resize back to the original size to introduce some distortion effect. The random shift augmentation is a little unusual: instead of using a small portion of shift (typically at most 20%), I used a 100% random shift, and made the border filling circular (cv2.BORDERWRAP). I expected this circular shift augmentation to be simulating the manufacturing pipeline. During test time, I used either horizontal or vertical flipping or both for augmentation. Totally there were four variants including the original version.
Model Types: I trained totally eight Unet models and averaged their results. All of them were based on qubvel’s github: https://github.com/qubvel/segmentation_models.pytorch. The encoders used were inceptionresnetv2, efficientnetv4, seresnext50, se_resnext101, all pretrained on Imagenet. I trained two models for each type of the encoders. Due to the nature of the evaluation metric, the benefit of correctly detecting a negative sample is larger than detecting and segmenting a positive sample. Therefore, removing false positives are the key to high score. To achieve this, I used the max of all the predicted pixel probabilities as something like a classification score to perform thresholding on the averaged (across eight models) max probabilities and rule out false positives. I also noticed that a lot of participants trained some classifiers to specifically perform this false positive removal, I didn’t do that because of the extra cost of training the classifiers and the extra time to run them during testing. Note that to further improve the performance, we can use the top K (K>1) pixel probabilities instead of a single max probability for a more reliable estimate, but I did not go that far during the competition.
Training: Every model was firstly trained for 40-50 epoches using the traditional BCE loss. After the model weights were trained to a good state, I reduced the learning rate and shifted to lovasz-hinge loss (https://github.com/bermanmaxim/LovaszSoftmax) due to its robustness for imbalanced dataset. The models were trained until convergence, and further tuned using pseudo-labeling on the public test data for 10 epoches. This also helped improved the performance a little.
Testing and Postprocessing: In the early stage when I started this competition, I noticed that my cross-validation results were much better than the public leaderboard. Later, some participants posted in the Kaggle forum that there are some duplicate or near duplicate images in the training set, possibly generated by the same manufacturing pipelines, and the labels are very likely correlated between the duplicates. On the one hand, this breaks the i.i.d. assumption of data when applying machine learning models, resulting in over-optimistic validation estimates due to the label leakage; on the other hand, we could take advantage of the duplicates by jointly predicting clusters of duplicated images instead of independently predicting the single images. However, the fact that the cross-validation performance on the training set was much better than the public test set indicated that the duplicates in the training and public test set may not be perfectly overlapped, which implied new manufacturing pipelines in the public test set. Apparently, the difference between the train and public test set could not be a coincidence, it was more like intentionally done to mimic some real world scenario that required the models to generalize to new data rather than taking advantage of the label leakage. So I asked myself if the private test set would be another new set. My guess was, yes, because if the public and private test set are very much alike the participants could still take advantage of the label leakage by probing the public leaderboard. So some method to take advantage of the duplicates in the private set was necessary. For this, I used the efficientnet_b4’s bottleneck features (because of the lower dimensions) generated from the last output of the encoder as features for each image. I then grouped the duplicated images into clusters by kmeans clustering the features of the test data. In each cluster, I averaged the max pixel probabilities mentioned earlier. So, for each of the test image, the averaged max probabilities of its cluster were compared with pre-defined thresholds ([0.6,0.6,0.5,0.5] for the four classes), and this image would be determined to be negative if smaller than the pre-defined thresholds, if otherwise, the total number of positive pixels (binarized with the threshold 0.5) would be compared with pre-defined thresholds ([547,715,1145,2959] corresponding to the 2% quantile thresholds found on the train data) to further remove false positives. Only the test images that survived these two rounds of false positive removal went to the run-length encoding process to generate final positive masks.
4th place solution posted in Severstal: Steel Defect Detection 5 months ago gold Hi everyone, and congratulation to all the participants!
We all entered this competition a little less than a month ago, right after the end of the APTOS competition. We were solving it separately and merged only about a day before the Merger deadline because of the emptying feeling of despair. It was the first competition, where the 'fit_predict' could not get into the bronze zone. Besides, heavy encoders were virtually as successful as simple resnet34. At first, here are the things that we tried, and that did not work at all:
I tried:
Adding hard negative mining by resampling dataset every epoch. Weights were chosen as the inverse of the Dice score per image. Absolutely no difference; Adding ArcFace to embeddings. It should help to distinguish classes better. One more nope; Training multi-stage network (inspired by pose detectors). It should help mimic bad markings. Performed worse than single-stage; Adding label smoothing. Nope again. @bloodaxe tried:
Adding mixup and Poisson blending to increase the number of images with defects; Trainig on double-sized crops: 256x1600 -> crop(256x512) -> resize(512x1024). Nope; Adding result of anisotropic segmentation to the input of the neural network, no gain; Training HRNetV2 with full resolution. 22 hours with 8xV100 and no better than ResNet34. Now, things that actually worked:
Training two-headed NN for segmentation and classification. Combine heads at inference time as with soft gating (mask.sigmoid() * classifier.sigmoid()) Focal loss / BCE + Focal loss Training with grayscale instead of gray-RGB FP16 with usage of Catalyst and Apex I trained only single-fold models and @bloodaxe trained 5-fold CV.
Our individual solutions were no more than at the end of the silver zone of the Public LB, but then we teamed up… and got a bit higher in the silver zone!
Our best (and final) ensemble consisted of 9 models with densenet201, efficientnetb5, resnet34, seresnext50 encoders, some of them with FPN decoders, and some with UNet. We added 3-flip TTA and averaged logits of the models, and soft gating applied. We binarized masks with a 0.55 threshold and zeroed out masks less than 256 pixels. Our total runtime for private+public is ~30 min.
Hardware We used servers from FastGPU.net with 8xV100 and 4xV100, that greatly reduced our experiment cycle length.
Some speculation about our rise on private LB:
We did not overfit to the pubic LB; We chose our models by the Dice score without empty masks. We relied on our classifiers and soft-gating for empty masks. We had different seeds/folds/models. Out private scores before merge are not that great (0.89-0.90). But they are much stronger when combined. Some moral by @bloodaxe
Keep going. Even if it seems that everything is lost and there is no hope, you still can push a little bit forward and learn something useful along the way; Team up! Exchanging ideas is really beneficial!
3rd Place Solution Summary Basic Model: Unet, Feature Pyramid Network (FPN) Encoder: efficientnet-b3, efficientnet-b4, efficientnet-b5, se-resnext50 Loss: Focal Loss Optimizer: Adam, init lr = 0.0005 Learning Rate Scheduler: ReduceLROnPlateau (factor=0.5, patience=3,cooldown=3, min_lr=1e-8) Image Size: 256x800 for training, 256x1600 for inference Image Augmentation: horizontal flip, vertical flip Sampler: Weighted Sampler Ensemble Model: I simply average 9 model output probability to achieve the final mask probability without TTA
FPN + efficientnet-b5 + concatenation of feature maps FPN + efficientnet-b4 Unet + efficientnet-b4 , add pseudo labeling data in training data Unet + efficientnet-b4, training with heavy image augmentation Unet + efficientnet-b4 +SCSE layer Unet + efficientnet-b4 +SCSE layer, add pseudo labeling data in training data Unet + efficientnet-b4 + Mish layer Unet + efficientnet-b3 Unet + se-resnext50 Threshold Label Thresholds: 0.7, 0.7, 0.6, 0.6 Pixel Thresholds: 0.4, 0.4, 0.4, 0.4
The key point I think that help me win a gold use code pipeline https://github.com/PavelOstyakov/pipeline make sure the diversity of models in ensemble using one-stage model pipeline prevent me from training classification model individually and tuning too much hyper-parameters. Simple is best.
7-th place solution posted in Severstal: Steel Defect Detection 5 months ago gold Our team member xuyuan and I had already participated in a segmentation challenge (TGS salt) and we thought this time will be a little easier, so joined this competation really late 4, 5 weeks ago. But these few weeks was not enough to try out many things. We haven't even time to train properly a classifier, so we stick to a pure segmentation ensemble.
Our final ensemble contained 3 efficientnet unet models:
2 folds of efficientnet-b7 unet. trained on 256x384 with mixup and label smoothing. The best public score was 0.91288. We don't know the private score because only the csv file was submitted. In the final ensemble, these b7 models was fine tuned on the full resolution. 1 fold of efficientnet-b4 unet. trained on 256x384 and fine tuned on the full resolution. 2 fold of efficientnet-b0 unet. trained on 256x384 and fine tuned on the full resolution. The best one got public 0.91284 and private 0.89719 score. For the ensemble we have tried many thresholds. For final submission we have used [0.8, 0.99, 0.75, 0.75] which produced 87, 0, 602, 111 class images. Another submission which used [0.8, 0.99, 0.78, 0.75] produced 87, 0, 577, 111 class images. This would have a private score 0.90819 and public 0.92016 which would be enough for the second place. So predicting the class3 less seems to be a trick.
The min area was 100 and pixel threshold 0.3.
What worked for us so far:
AdamW with the Noam scheduler gitlab pipeline to manage training experiments mixup and label smoothing fine tuning on the full resolution random sampler cross entropy loss What didn't work:
SGD SWA worked really good for the salt competition, but for this competition it didn't work at all pseudo labeling (trained on last 2, 3 days) training a classifier balanced sampler
10th place solution posted in Severstal: Steel Defect Detection 5 months ago gold Overall approach 1) Separate models for different defects (since the number of classes is small - it is feasible)
2) Not predicting defect2
3) Training on full size images
4) 3-fold validation
Encoder SE ResNeXt-50 / EfficientNet-b3, pretrained on ImageNet. First model still seems like a good tradeoff between size/speed and accuracy, and efficientnet is relatively new and I wanted to try using it for segmentation tasks. Ironically, its implementation in PyTorch is not very efficient (at least, compared to the one in TensorFlow, see this github issue). This limited usage of “upscaled“ versions of it, and I settled on b3, which was close to seresnext50, although the batch size that I was able to fit into GPU RAM was smaller (3 vs 4) and it potentially decreased batchnorm performance.
Model Experimented with Unet, FPN and PSPNet from great Segmentation models by @pavel92 . Unet and PSPNet performed noticeably worse with default settings and were discarded from the experiments at the early stages of competitions.
I compared FPN with several variations that I considered to be reasonable, listed below. FPNA: vanilla FPN. FPNB: take not the 4 uppermost layers of encoder, but 4 lowermost. Intuition was that if a defect is a local artifact, then using stronger semantic information is unnecessary, and using higher-resolution layers instead would allow to better locate small defects. In my experiments it outperformed FPNA at 128x800 resolution, but underperformed at full resolution – perhaps receptive field was insufficient. FPNC: take all (5) levels of encoder, this might help to handle different scale levels properly. Unfortunately the increase in GPU RAM usage and training time didn’t translate to improvement in performance in my tests. FPND: Instead of summing outputs of encoder layers, concatenate them (reducing the number of channels in decoder layers’ output to keep final number of channels the same) - this is similar to hypercolumns, which worked well in TGS competition. This version seemed to be comparable with FPNA, and outperformed it in defect4 detection, perhaps due to the fact that this defect had significantly larger scale variation compared to other 3.
I experimented a bit with deep supervision (with a aux output for each defect class presence), and even though I initially have gotten some improvement using this scheme, I have later managed to match this result without DS, and decided to drop it to avoid additional complexity.
I have also tried to train classifiers for rejecting empty images before classification to avoid false positives, but haven’t been able to get more accuracy than I already had with segmentation models, in the end the only classifier that I used was for defect1 (taken from Heng CherKeng’s starter guide) – it seemed to have a better precision than my defect1 models.
Augmentations Did a “grid search” for optimal augmentations, training same small model with a single augmentation for a fixed number of epochs and validating without augmentations. Surprisingly, “base augmentation” (no augmentations) was always better, even compared to horizontal flip. This was not something I have encountered before, and main guess was lack of model capacity to handle the variation after introducing augmentations. Proper way of testing it was running the experiment again with a bigger model, but I didn’t have time/resources to do it properly, so I decided to make an assumption that horizontal flip doesn’t destroy important information and shouldn’t lead to overgeneralization, and other “generic” augmentations should be treated carefully.
I also tried few “custom“ augmentations: “SteelShift” (cyclically shift image, handling steel sheet border separately) to simulate a capturing steel sheet at a different time on a moving conveyor belt, and “BorderLoc” (encode distance to steel sheet border in a separate channel, black areas and images with no black areas are handled separately). Motivation for adding 2 custom augmentation, targeted at images with steel sheet borders was the fact, that even though there was correlation between CV and pubLB, CV was always better, and the only significant difference between 2 sets was the fraction on images where steel sheet border was visible.
Impact of these 2 “custom” augmentations was somewhat hard to asses, I included some models with “SteelShift” into final ensemble to increase diversity of predictions, and didn’t include “BorderLoc”, because it would require either applying it to all images (and discard models trained without it), or rewriting inference code to apply different augmentations to different models in ensemble)
I ended up with follwing augmentation “modes”:
a) only horizontal flip b) RandAugment - up to 2 augmentations chose randomly from the list below (albumentation names):
HorizontalFlip VerticalFlip ISONoise IAAAdditiveGaussianNoise CoarseDropout RandomBrightness RandomGamma IAASharpen Blur MotionBlur RandomContrast Loss 0.75 BCE + 0.25 LovaszHinge Lovasz worked significantly better than Dice. Perhaps a good idea was a separate fine-tuning stage, where only Lovasz was used, but I never got around to testing that. Also wanted to try replacing BCE with Focal, but didn’t have time and also was worried about proper coefficient balancing – focal is usually significantly smaller than CE, and magintude might change during training.
Post-processing
binarize averaged predictions using threshold, determined on validation set zero masks with number of predicted pixels less than a threshold, determined on a validation set I have also thought about filling holes and removing small components in predictions, but after looking at predictions on the public test set decided that it will not change the score much, but might backfire in unexpected way on unseen data.
Pseudo-labeling After many teams started to score more than 0.920 on pubLB it became evident that at least some of them are using semi-supervised learning. It didn’t seem to be extremely beneficial, because size of public test set in this competition wasn’t very big compared to train set size, but it should at least help to reduce to domain mismatch between train and public test sets, so I decided to give it a go, after plateauing in score on pubLB. First I’ve tried to include all predicted test negative samples (this was easier, since I didn’t need the masks) into the train set (keeping folds unchanged and not including pseudo-labeled images into validation set, to be able to track improvement) and fine tune the models I had using this data. It didn’t help much, so I moved on to including all predicted test set images the same way into training set and training from scratch. This allowed to get a single model score equal to the ensemble score on pubLB, but ensembling such models didn’t provide any improvement – it made sense, since most likely the models were just “remembering” masks for all the samples from the public test set, and there was no variation between predictions on the public test set images. Finally, I decided to do it the “proper” way, and included only confident samples into the training set. I measured confidence similar to approach of winners of TGS competition - count the number of pixels with confidence < 0.2 or > 0.8, and consider images that have more than N confident pixels to be reliable. I’ve noticed, that no matter how certain/good the predictions are, there is always an uncertain region along the predicted defect boundary (which makes sense, since the ground truth annotation was fairly arbitrary in terms of boundary). To avoid this uncertain area affecting the “prediction confidence” estimate, I’ve applied morphological operations to “grow” confident regions – this allowed to “close” uncertain areas along the boundaries, and still kept the “really” uncertain defect predictions. This was done differently for each defect type, since they had different areas and boundary uncertainties. This approach helped me to move past 0.917 on pubLB to 0.919, with most of this change attributed to improved predictions for defect3. I have also applied pseudo-labeling to defect1 and defect4, but been unable to get improvement for defect1 (makes sense, since my predictions for it were far from being perfect), and unclear results for defect4 (not really sure why, maybe due to more skewed defect/nondefect distribution of data).
Closing thoughts Thanks to organizers of this competition and all participants - it was a really fun and intense experience! This is a second competition where I have been using DVC and albumentation libraries, and first for using Catalyst, Segmentation Models and Weights & Biases - really enjoying these, if you haven't tried them yet - I strongly encourage you to.
Metric时用了Dice loss,训练时用了crossentropyloss(是nll loss和log_softmax的结合) crossentropyloss = nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction) https://www.kaggle.com/bigironsphere/loss-function-library-keras-pytorch https://blog.csdn.net/geter_CS/article/details/84857220?depth_1-utm_source=distribute.pc_relevant.none-task&utm_source=distribute.pc_relevant.none-task BCE 交叉熵损失函数的理解 https://zhuanlan.zhihu.com/p/35709485
12th solution 有视频 https://www.youtube.com/watch?v=24GzxCrcupk
Pseudo label https://zhuanlan.zhihu.com/p/34899693 First I’ve tried to include all predicted test negative samples (this was easier, since I didn’t need the masks) into the train set (keeping folds unchanged and not including pseudo-labeled images into validation set, to be able to track improvement) and fine tune the models I had using this data. It didn’t help much, so I moved on to including all predicted test set images the same way into training set and training from scratch. This allowed to get a single model score equal to the ensemble score on pubLB, but ensembling such models didn’t provide any improvement – it made sense, since most likely the models were just “remembering” masks for all the samples from the public test set, and there was no variation between predictions on the public test set images. Finally, I decided to do it the “proper” way, and included only confident samples into the training set. I measured confidence similar to approach of winners of TGS competition - count the number of pixels with confidence < 0.2 or > 0.8, and consider images that have more than N confident pixels to be reliable. I’ve noticed, that no matter how certain/good the predictions are, there is always an uncertain region along the predicted defect boundary (which makes sense, since the ground truth annotation was fairly arbitrary in terms of boundary). To avoid this uncertain area affecting the “prediction confidence” estimate, I’ve applied morphological operations to “grow” confident regions – this allowed to “close” uncertain areas along the boundaries, and still kept the “really” uncertain defect predictions. This was done differently for each defect type, since they had different areas and boundary uncertainties.
TTA 测试时增强(test time augmentation, TTA) 在test 而不是train时, 对数据进行上下和左右翻转,增加对比度。此时必须对label进行[:,:,:,: -1]同样的翻转。最后结果再平均取值 mlcomp.contrib.transform.tta
Q&A: why efficientNet? why b3? how to deal with the data imbalance?
why 3-fold validation?
focal loss https://zhuanlan.zhihu.com/p/49981234 [日常] 关于 Focal Loss(附实现代码) https://zhuanlan.zhihu.com/p/75542467 focal loss 是解决分类问题。以交叉熵为原型,
(二)计算机视觉四大基本任务(分类、定位、检测、分割) https://zhuanlan.zhihu.com/p/31727402 https://www.zhihu.com/question/36500536
U-Net系列的衍化 https://zhuanlan.zhihu.com/p/44958351
Unet中上采样和反卷积 https://zhuanlan.zhihu.com/p/48501100
最终复现方案 https://www.kaggle.com/c/severstal-steel-defect-detection/discussion/114410
目标分割+分类 视觉分割类, 赛题是针对钢板表面的图像,区分出四种瑕疵类型,并且分割出每种瑕疵类型所对应的位置与区域---语义分割。评测指标是Dice coefficient。
evaluated on the mean Dice coefficient 来评价X与Y的相似度 Dice 数组趋近于1 说明相似度越大, 最后结果是mean of the Dice coefficients for each <ImageId, ClassId> pair in the test set. 空值为一一
EncodedPixels 以run-length encoding on the pixel values. 起始位置+pixels长度 '1 3 10 5' implies pixels 1,2,3,10,11,12,13,14 The pixels are numbered from top to bottom, then left to right: 1 is pixel (1,1), 2 is pixel (2,1), etc.
a Kernels-only competition
四种缺陷类型 ClassId = [1, 2, 3, 4]
难点 难点:
每个图片包含零个为6000,1个缺陷为6100, 2个缺陷为200,3个没有
1st Place Solution https://www.kaggle.com/c/severstal-steel-defect-detection/discussion/114254 Classification Classification is an important part of this competition. Even though classifiers can only slightly improve your score after 0.915 in public LB, they can work as a preliminary screening (初步筛选) by filtering out around half of images with no defects. This will enable us to ensemble more models in segmentation part. We trained our classifiers with random crop of 224x1568 and do inference on full size. This random crop gives slightly improvement on accuracy. Augmentations: Randomcrop, Hflip, Vflip, RandomBrightnessContrast (from albumentations) and a customized defect blackout Since this is a sematic segmentation task, we know exactly where the defects are. As a result, these defects components can be randomly blacked out (中断) and the label for this image will also change from 1 to 0 if all defects are blacked out. This augmentation indeed works on local CV and public LB. Here are some graphs of the training process of a ResNet34 classifier. Batchsize: 8 for efficientnet-b1, 16 for resnet34 (both accumulate gradients for 32 samples) Optimizer: SGD Model Ensemble: 3 x efficientnet-b1+1 x resnet34 TTA: None, Hflip, Vflip Threshold: 0.6,0.6,0.6,0.6
Segmentation We have to admit that we used models from @lightforever these models improved our score from 0.907 private LB to our current score. Train data: 256x512 crop images Augmentations: Hflip, Vflip, RandomBrightnessContrast (from albumentations) Batchsize: 12 or 24 (both accumulate gradients for 24 samples) Optimizer: Rectified Adam Models: Unet (efficientnet-b3), FPN (efficientnet-b3) from @pavel92 segmentationmodelspytorch Loss: BCE (with posweight = (2.0,2.0,1.0,1.5)) 0.75BCE+0.25DICE (with posweight = (2.0,2.0,1.0,1.5)) Model Ensemble: 1 x Unet(BCE loss) + 3 x FPN(first trained with BCE loss then finetuned with BCEDice loss) +2 x FPN(BCEloss)+ 3 x Unet from mlcomp+catalyst infer TTA: None, Hflip, Vflip Label Thresholds: 0.7, 0.7, 0.6, 0.6 Pixel Thresholds: 0.55,0.55,0.55,0.55 Postprocessing: Remove whole mask if total pixel < threshold (600,600,900,2000) + remove small components with size <150
Pesudo Label We did 2 rounds of pseudo labels in this competition. The first round is generated from a submission with 0.916 public LB, maybe it is too early? The second round was done several days before the end of this competition, generated from a submission with 0.91985 public LB. With pseudo label and public models, we finally improved from 0.91985 to 0.92124 on public LB and from 0.90663 to 0.90883 on private LB. The pseudo labels are chosen if classifiers and segmentation networks make the same decisions. We got this idea from Heng. An image will only be chosen if the probabilities from classifiers are all over 0.95 or below 0.05 and it gets same result from segmentation part. According to this rule, 1135 images are chosen and added to trainset.
Predictions on public LB: Defect 1: 97(128) Defect 2: 2(43) Defect 3: 611(741) Defect 4: 110(120) Sum Pos: 820(1032)