Closed libid0nes closed 4 years ago
In general, large minibatch size (like total samples) makes fast convergence but may be stuck in local minima and it requires huge amount of memory. In contrary, small minibatch size needs lots of iteration for convergence, but it is more robust and you can control the usage of memory.
For training data, I think you don't have to filter the data as long as it is correctly labeled. Various input data make the model more stable.
Well, thanks for the answer, I ask because I read this information in that post: https://stats.stackexchange.com/a/153535
For training data, I think you don't have to filter the data as long as it is correctly labeled. Various input data make the model more stable.
Even if there is text in the picture? Will this confuse the network? Because there are several versions of images in the samples, one without text, the other with text, also in some cases the text gets on parts of the body, head, which can give false data about the construction of the geometry and features of a particular character.
I mean, the neural network might start thinking that these "hieroglyphs" or "English characters" are the feature for a particular tag, right?
Although the percentage of such images is not so much, but it can still blur the accuracy of the neural network at certain points, isn't it?
By the way, how does the neural network react to such images? They are not only multi-frame, but also with an unusual aspect ratio and resolution.
For example: https://chan.sankakucomplex.com/post/show/19176632
Unusual image ratio (too wide or tall) may be the problem because all input images are resized and padded to 299x299 preserving its ratio. So If the image is too long, as a result its actual information may be smaller.
I think hieroglyphs or english characters may not be a problem (of course as long as correctly tagged). Because that features are extracted by the network and estimated independently. Even that "noisy" inputs makes the network more robust.
all input images are resized and padded to 299x299
Speaking of which, what does this parameter affect? If you increase it, will the accuracy increase? Although I can say with confidence that increasing the resolution will increase the need for memory and performance, but I still wonder what effects can cause a decrease or increase in this parameter.
Even that "noisy" inputs makes the network more robust.
Well, I will try to train the network with minimal interference on my part, I will only remove monochrome, black and white images, and images with a suboptimal aspect ratio.
I still can't start training, due to data loading from sankakucomplex, as their security system causes a lot of problems...
By the way, what is the difference between v2 and v1, as well as the experimental v3 from all the others in choosing a model?
Speaking of which, what does this parameter affect? If you increase it, will the accuracy increase?
That is exactly what I tested internally now. v3 model will uses 512x512 resolution.
By the way, what is the difference between v2 and v1, as well as the experimental v3 from all the others in choosing a model?
v1 is the first DeepDanbooru model which is slightly deeper than original resnet-152 imagenet model. (https://github.com/microsoft/CNTK/blob/master/Examples/Image/Classification/ResNet/Python/resnet_models.py) v2 is more deeper model than v1 but it is not fully trained/tested yet because TensorFlow throws CUDA error when training. v3 is slightly deeper than v1 and is different for its output channel. It is created for 512x512 resolution.
You can change your input size for any model version, but large input size makes you can't train with consumer graphic card.
v1 & v3 diff
here is tags diff
also it take longer to get the result. i got around 50-60 second per image for v1 and 95-100130 second per image for v3
@kichangkim how are precision & recall for v3 in comparison to v1?
@rachmadaniHaryono I think that It can't be compared correctly because its dataset is changed, but here is last training logs: v1:
Epoch[29] Loss=1416.928884, P=0.773589, R=0.502963, F1=0.609590, Speed = 47.5 samples/s, 60.00 %, ETA = 2019-12-31 15:28:03
Epoch[29] Loss=1343.514524, P=0.779304, R=0.518631, F1=0.622791, Speed = 47.5 samples/s, 60.00 %, ETA = 2019-12-31 15:14:55
Epoch[29] Loss=1406.559717, P=0.777394, R=0.508826, F1=0.615071, Speed = 47.2 samples/s, 60.00 %, ETA = 2019-12-31 16:41:41
v3:
Epoch[30] Loss=540.683345, P=0.788256, R=0.545070, F1=0.644485, Speed = 22.9 samples/s, 61.25 %, ETA = 2020-02-25 03:23:44
Epoch[30] Loss=536.273903, P=0.782580, R=0.550326, F1=0.646218, Speed = 23.1 samples/s, 61.25 %, ETA = 2020-02-25 00:30:51
Epoch[30] Loss=563.256741, P=0.784784, R=0.536157, F1=0.637072, Speed = 23.0 samples/s, 61.25 %, ETA = 2020-02-25 01:56:43
P=precision, R=recall, F1=f1 score for training dataset. DeepDanbooru doesn't have validation set.
actual v1 to v3 diff
v3 compatible v1 tags
changelog
@kichangkim
Unusual image ratio (too wide or tall) may be the problem because all input images are resized and padded to 299x299 preserving its ratio. So If the image is too long, as a result its actual information may be smaller.
based on danbooru wiki for long image:
An image that is either wide or tall:
that is, at least 1024px long on one side,
and whose long side is at least four times longer than its short side.
maybe that can be used as basic of long image specification
is there exist parent tag, which rely only on children tag?
maybe skip text related tag or not tag for information which is not contained on image?
it is mostly miss than hit especially unknown language
(possibly) valid:
[1] namesake is only effective if character is known, which mean it have to include more character that currently exist on model
[2] artist is not included on the model, so no relation can be checked
[3] series is not included on the model, so it is not effective
[4] debatable as it may still be effective
[5] model can't recognize pokemon
is it better to skip width & tall image?
Seperate the image into smaller sub-images that have reasonable overlaps. That way it can detect regions of the images without changing aspect ratios or downgrading resolutions.
is there exist parent tag, which rely only on children tag?
In that case a hierarchical tagging system is in order... but if it is not hierarchical and is instead a Directed Acyclic Graph (DAG) then a knowledge graph representation could be useful? I would like to find a solution that can do this well.
Seperate the image into smaller sub-images that have reasonable overlaps. That way it can detect regions of the images without changing aspect ratios or downgrading resolutions.
i still can't imagine how to do that. if someone make implementation of it, please notify me.
parent-children tagging system
i just thought something about removing parent tag
it is possible that even if parent tag rely only on children tag(s), it have to be calculated because at least one of the children tag may have low image count and filtered
maybe instead remove those tag, just merge it into single tag e.g. 'text'. this way model can recognize text but don't have to guess which language is it.
but i doubt this will work with name and username
another idea is just merge those tag groups (text, name, username) into single tag e.g. text
long image
i checked my image library and found that long image with full body
is still recognizable even if it is downsized. but if model only trained with that tag, there is possibility that long image will bias to full body
tag
e:
parent children tag
afaik there is no program yet to parse danbooru to get the data. i may (or may not) create simple script to do that
long image statistic
@KichangKim can you give statistic of long image on dataset, like actual width height and tag count?
Long images are handled as just "small objects with large empty space" until it has clean backgrounds because it will be padded with "edge" mode (edge pixels are duplicated for padding). So it may not be critical problem I think.
Pre tag filtering (merge confusing tags into single one and so on) may be helpful, but it needs additional knowledge for tag itself and make the system complex.
i still can't imagine how to do that. if someone make implementation of it, please notify me.
Can't you ask the author? As far as I know, he implemented this feature on his website: http://kanotype.iptime.org:8003/deepdanbooru
@Libidine Web demo implements evaluation-time cropping, but it is not part of deepdanbooru itself currently.
But you can easily implement yourself by using numpy's subarray. The main idea is that crop input image into multiple small regions and evaluate all. Then get max score of it. Some tags are affected by cropping (ex, number-related tags, lower/upper tags, frame related tags and so on) so you should ignore or control that.
Of course, it need more computation time depend on the number of subregions.
wait i thought @DonaldTsang propose new method instead the current one
from my understanding the image will be resized to proposed size e.g. 299x299 or 512x512 and the rest will be padded by with "edge" mode (copied from above response, still not quite understand the edge mode yet)
that is different from this part
That way it can detect regions of the images without changing aspect ratios or downgrading resolutions.
@rachmadaniHaryono I think that It can't be compared correctly because its dataset is changed, but here is last training logs: v1:
Epoch[29] Loss=1416.928884, P=0.773589, R=0.502963, F1=0.609590, Speed = 47.5 samples/s, 60.00 %, ETA = 2019-12-31 15:28:03 Epoch[29] Loss=1343.514524, P=0.779304, R=0.518631, F1=0.622791, Speed = 47.5 samples/s, 60.00 %, ETA = 2019-12-31 15:14:55 Epoch[29] Loss=1406.559717, P=0.777394, R=0.508826, F1=0.615071, Speed = 47.2 samples/s, 60.00 %, ETA = 2019-12-31 16:41:41
v3:
Epoch[30] Loss=540.683345, P=0.788256, R=0.545070, F1=0.644485, Speed = 22.9 samples/s, 61.25 %, ETA = 2020-02-25 03:23:44 Epoch[30] Loss=536.273903, P=0.782580, R=0.550326, F1=0.646218, Speed = 23.1 samples/s, 61.25 %, ETA = 2020-02-25 00:30:51 Epoch[30] Loss=563.256741, P=0.784784, R=0.536157, F1=0.637072, Speed = 23.0 samples/s, 61.25 %, ETA = 2020-02-25 01:56:43
P=precision, R=recall, F1=f1 score for training dataset. DeepDanbooru doesn't have validation set.
According to the log, v3 is much better than v1. What's the hyper-parameters setting for v3, such as learning rate(or scheduler) and batch size. I found the learning rate for v2 is 0.001 in the default project and not changed. By the way, what do you think v3 benifits most from, model arch, input size, data filter or hyper-params?
Hi, I have new questions: I read that increasing Batchsize leads to improved learning accuracy, is this true?
What exactly is the: minibatch_size parameter responsible for? Is this a classic Batchsize? or something else?
Now, as for training materials, should I filter them? I mean, I downloaded a huge number of images from booru sites. It contains not only illustrations, but also lineArts, comics, doujinshi, materials with dialogs and text, covers, and so on.
What exactly do I need to remove from the training material? At this point, I'm removing all LineArts, and black and white images.