Batchsize and training material requirements.

libid0nes commented 4 years ago

Hi, I have new questions: I read that increasing Batchsize leads to improved learning accuracy, is this true?

What exactly is the: minibatch_size parameter responsible for? Is this a classic Batchsize? or something else?

Now, as for training materials, should I filter them? I mean, I downloaded a huge number of images from booru sites. It contains not only illustrations, but also lineArts, comics, doujinshi, materials with dialogs and text, covers, and so on.

What exactly do I need to remove from the training material? At this point, I'm removing all LineArts, and black and white images.

KichangKim commented 4 years ago

In general, large minibatch size (like total samples) makes fast convergence but may be stuck in local minima and it requires huge amount of memory. In contrary, small minibatch size needs lots of iteration for convergence, but it is more robust and you can control the usage of memory.

For training data, I think you don't have to filter the data as long as it is correctly labeled. Various input data make the model more stable.

libid0nes commented 4 years ago

Well, thanks for the answer, I ask because I read this information in that post: https://stats.stackexchange.com/a/153535

For training data, I think you don't have to filter the data as long as it is correctly labeled. Various input data make the model more stable.

Even if there is text in the picture? Will this confuse the network? Because there are several versions of images in the samples, one without text, the other with text, also in some cases the text gets on parts of the body, head, which can give false data about the construction of the geometry and features of a particular character.

I mean, the neural network might start thinking that these "hieroglyphs" or "English characters" are the feature for a particular tag, right?

Although the percentage of such images is not so much, but it can still blur the accuracy of the neural network at certain points, isn't it?

libid0nes commented 4 years ago

By the way, how does the neural network react to such images? They are not only multi-frame, but also with an unusual aspect ratio and resolution.

For example: https://chan.sankakucomplex.com/post/show/19176632

KichangKim commented 4 years ago

Unusual image ratio (too wide or tall) may be the problem because all input images are resized and padded to 299x299 preserving its ratio. So If the image is too long, as a result its actual information may be smaller.

I think hieroglyphs or english characters may not be a problem (of course as long as correctly tagged). Because that features are extracted by the network and estimated independently. Even that "noisy" inputs makes the network more robust.

libid0nes commented 4 years ago

all input images are resized and padded to 299x299

Speaking of which, what does this parameter affect? If you increase it, will the accuracy increase? Although I can say with confidence that increasing the resolution will increase the need for memory and performance, but I still wonder what effects can cause a decrease or increase in this parameter.

Even that "noisy" inputs makes the network more robust.

Well, I will try to train the network with minimal interference on my part, I will only remove monochrome, black and white images, and images with a suboptimal aspect ratio.

I still can't start training, due to data loading from sankakucomplex, as their security system causes a lot of problems...

By the way, what is the difference between v2 and v1, as well as the experimental v3 from all the others in choosing a model?

KichangKim commented 4 years ago

Speaking of which, what does this parameter affect? If you increase it, will the accuracy increase?

That is exactly what I tested internally now. v3 model will uses 512x512 resolution.

By the way, what is the difference between v2 and v1, as well as the experimental v3 from all the others in choosing a model?

v1 is the first DeepDanbooru model which is slightly deeper than original resnet-152 imagenet model. (https://github.com/microsoft/CNTK/blob/master/Examples/Image/Classification/ResNet/Python/resnet_models.py) v2 is more deeper model than v1 but it is not fully trained/tested yet because TensorFlow throws CUDA error when training. v3 is slightly deeper than v1 and is different for its output channel. It is created for 512x512 resolution.

You can change your input size for any model version, but large input size makes you can't train with consumer graphic card.

rachmadaniHaryono commented 4 years ago

v1 & v3 diff

here is tags diff

tags diff

+ak-12_(girls_frontline) +akanbe +anchovy_(girls_und_panzer) +aoba_moca +ar-15 +artoria_pendragon_(swimsuit_ruler)_(fate) +asa_no_ha_(pattern) +ashido_mina +ass_shake +assam_(girls_und_panzer) +ballerina +bandaid_on_arm +blue_horns +blunt_ends +braided_bangs +broken_chain +broken_horn +buruma_aside +calligraphy_brush_(medium) +carpaccio_(girls_und_panzer) +character_print +chi-hatan_military_uniform +colorado_(kantai_collection) +cooler +copyright +covered_face +crescent_rose +cropped_vest +cutting_board +darjeeling_(girls_und_panzer) +dark_areolae +drugs +dust_cloud +duster +ear_protection +eldridge_(azur_lane) +elise_(fire_emblem) +erwin_(girls_und_panzer) +evening_gown +fat_folds +fur-trimmed_hood +gift_bag +golden_snub-nosed_monkey_(kemono_friends) +grey_bra +hair_strand +half-skirt +hanasakigawa_school_uniform +hand_on_own_leg +heart_in_eye +heshikiri_hasebe +hikawa_sayo +horn_bow +humboldt_penguin_(kemono_friends) +incoming_kiss +interspecies +ishtar_(fate)_(all) +k/da_(league_of_legends) +kagami_mochi +kamina_shades +katagiri_sanae +katarina_du_couteau +katyusha_(girls_und_panzer) +keizoku_school_uniform +kizuna_akari +kochou_shinobu +kokkoro_(princess_connect!) +kumada_masaru +kyaru_(princess_connect) +large_tail +leather_boots +light_nipples +lysithea_von_ordelia +magical_boy +maruyama_aya +mary_(pokemon) +may_(guilty_gear) +medical_eyepatch +minase_akiko +mod3_(girls_frontline) +mummy +nanachi_(made_in_abyss) +national_shin_ooshima_school_uniform +natsu_megumi +nonna_(girls_und_panzer) +opening_door +orange_pekoe_(girls_und_panzer) +oversized_shirt +patterned_clothing +pearl_bracelet +pink_blouse +pink_wings +pointless_condom +poke_ball_print +print_bow +raphtalia +rating:explicit +rating:questionable +rating:safe +rectangular_eyewear +reines_el-melloi_archisorte +reins +reverse_upright_straddle +ribbed_leotard +ribbon-trimmed_dress +rosehip_(girls_und_panzer) +saitou_(pokemon) +sangvis_ferri +santa_bikini +shiro_(dennou_shoujo_youtuber_shiro) +sideless_outfit +sidewalk +single_strap +skull_earrings +snout +sock_garters +spaghetti +st._louis_(azur_lane) +steering_wheel +sunflower_hair_ornament +tam_o'_shanter +tentacles_under_clothes +thighhighs_over_pantyhose +todoroki_shouto +toe-point +tools +tsurumaki_kokoro +tsushima_(kantai_collection) +two-tone_ribbon +u_u +udagawa_tomoe +uehara_himari +uzuki_sayaka +white_headband +white_serafuku +white_suit +winged_footwear +yae_sakura +yoshida_yuuko_(machikado_mazoku) +yuri_sakazaki +yuudachi_(azur_lane) +yuuri_(pokemon) -anchovy -assam -carpaccio -darjeeling -katyusha -nonna -orange_pekoe -pokemon_trainer -rosehip -winged_shoes

also it take longer to get the result. i got around 50-60 second per image for v1 and 95-~~100~~130 second per image for v3

my laptop spec

``` OS: Ubuntu 19.04 x86_64 Host: X201EP 1.0 Kernel: 5.0.0-38-generic Uptime: 28 mins Packages: 3792 (dpkg), 3 (snap) Shell: zsh 5.5.1 Resolution: 1366x768 WM: i3 Theme: Ambiance [GTK2/3] Icons: ubuntu-mono-light [GTK2/3] CPU: Intel Celeron 847 (2) @ 1.100GHz GPU: Intel 2nd Generation Core Processor Family Memory: 1883MiB / 3825MiB ```

@kichangkim how are precision & recall for v3 in comparison to v1?

https://stats.stackexchange.com/questions/21551/how-to-compute-precision-recall-for-multiclass-multilabel-classification

KichangKim commented 4 years ago

@rachmadaniHaryono I think that It can't be compared correctly because its dataset is changed, but here is last training logs: v1:

Epoch[29] Loss=1416.928884, P=0.773589, R=0.502963, F1=0.609590, Speed = 47.5 samples/s, 60.00 %, ETA = 2019-12-31 15:28:03
Epoch[29] Loss=1343.514524, P=0.779304, R=0.518631, F1=0.622791, Speed = 47.5 samples/s, 60.00 %, ETA = 2019-12-31 15:14:55
Epoch[29] Loss=1406.559717, P=0.777394, R=0.508826, F1=0.615071, Speed = 47.2 samples/s, 60.00 %, ETA = 2019-12-31 16:41:41

v3:

Epoch[30] Loss=540.683345, P=0.788256, R=0.545070, F1=0.644485, Speed = 22.9 samples/s, 61.25 %, ETA = 2020-02-25 03:23:44
Epoch[30] Loss=536.273903, P=0.782580, R=0.550326, F1=0.646218, Speed = 23.1 samples/s, 61.25 %, ETA = 2020-02-25 00:30:51
Epoch[30] Loss=563.256741, P=0.784784, R=0.536157, F1=0.637072, Speed = 23.0 samples/s, 61.25 %, ETA = 2020-02-25 01:56:43

P=precision, R=recall, F1=f1 score for training dataset. DeepDanbooru doesn't have validation set.

rachmadaniHaryono commented 4 years ago

actual v1 to v3 diff

tags_diff_src_v1_comp_v3.txt

v3 compatible v1 tags

tags_comp_v3.txt

changelog

girls und panzer character got series suffix
winged_shoes to winged_footwear
pokemon trainer kept as is because it is deleted and no replacement on v3 tags

@kichangkim

is it better to skip width & tall image?

Unusual image ratio (too wide or tall) may be the problem because all input images are resized and padded to 299x299 preserving its ratio. So If the image is too long, as a result its actual information may be smaller.

based on danbooru wiki for long image:

An image that is either wide or tall:
that is, at least 1024px long on one side,
and whose long side is at least four times longer than its short side.

maybe that can be used as basic of long image specification

is there exist parent tag, which rely only on children tag?
maybe skip text related tag or not tag for information which is not contained on image?

it is mostly miss than hit especially unknown language

text
- background_text [4]: Text is written on the background, in English, Japanese or other language.
- chinese_text
- english_text
- engrish_text
- french_text
- german_text
- korean_text
- romaji_text
- russian_text
- simplified_chinese_text
- text_focus [4]: Indicates that text is a major part of the image.
- text_only_page [4]: Indicates that the image contains only text.
- thai_text
- wall_of_text [4]: Images with a large block of text included, usually in the background.
name
- artist_name
- brand_name_imitation
- character_name
- circle_name
- company_name
- copyright_name
- group_name
- song_name
username
- deviantart_username
- twitter_username
- weibo_username
- patreon_username
namesake [1]
- namesake: Name shared by at least two persons.
- object_namesake: A type of pun when a character is paired with any kind of object that makes a reference to their civilian or alter ego's name.
connection
- company_connection
- creator_connection
- season_connection
- seiyuu_connection
- trait_connection: A crossover, cosplay or parody image clearly depicting two or more characters who share the same or closely similar personality traits, or have similar circumstances occur to them in their respective stories
parody [4]
- parody: Parody implies a character is mimicking a scene, a dialogue, or another series, with the intention of being humorous
- style_parody
- title_parody
- card_parody
- fine_art_parody
single or unknown
- artist_logo (not found)
- artist_self-insert [2]: When the artist or creator puts themselves in their story, game, movie, Etc.
- copyright (not found)
- crossover [3]: A crossover is when two or more characters from unrelated copyrights are shown together in one scene.
- multiple_crossover [3]: Crossovers within a franchise (for example all Final Fantasy villains together, or all Links meeting each other) don't qualify unless they also involve two or more other franchises.
- fusion [4]: The merging of 2 or more characters into a single being
- dated [4]: This tag should be used for when the date of creation is written somewhere on the image.
- lyrics: Any post with song lyrics written in it.
- number [4]: When a number is written somewhere in the image, consisting of any of the ten following digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9.
- numbered [4]: When images that are part of a set are sequentially numbered (usually in date order) on the image.
- page_number [4]: A page number typically used to navigate a book, comic, or magazine.
- pokemon_number [4]: The number of a Pokémon species mentioned somewhere on the post.
- out_of_character: Used when one or more characters are not acting like their usual selves.
- product_placement: A situation wherein a well-known brand name or product is intentionally included in the image.
- ranguage: What you get when a country tries to use any language that's not their native language, and makes mistakes in the process
- sample: Promotional images for doujinshi, games or CG sets
- timestamp [4]: Any picture that includes some set of numbers that represent time
- web_address: A web address (e.g.: www.example.com or http://www.example.com/) is written somewhere.
- borrowed_character: When an artist draws an original character which was originally created by a different artist
- character_signature: For when an image appears to be signed by a character appearing in the image.

(possibly) valid:

name_tag: A name tag sewn a piece of clothing, such as on a gym uniform or school swimsuit.
color_connection: When characters are grouped together based upon their theme color, or one character is referencing or cosplaying as another who shares the same theme color.
calendar_(medium): Artwork from a calendar, frequently a scan.
rating: A classification that rates the suitability of content for any type of media
character_censor: When e.g. a character, a character's head or similar is used as a novelty censor.
character_print: Clothing or an item that has a specific character printed on it.
character_profile: Use this tag when an image contains information about a pictured character such as name, personality traits, likes and dislikes, etc
character_sheet: Multiple drawings of the same character in different poses (キャラクター設定 "character set" or キャラ表 "character table"), or the character and their accessories drawn separately (持ち物検査 "belongings inspection").

[1] namesake is only effective if character is known, which mean it have to include more character that currently exist on model

[2] artist is not included on the model, so no relation can be checked

[3] series is not included on the model, so it is not effective

[4] debatable as it may still be effective

[5] model can't recognize pokemon

DonaldTsang commented 4 years ago

is it better to skip width & tall image?

Seperate the image into smaller sub-images that have reasonable overlaps. That way it can detect regions of the images without changing aspect ratios or downgrading resolutions.

is there exist parent tag, which rely only on children tag?

In that case a hierarchical tagging system is in order... but if it is not hierarchical and is instead a Directed Acyclic Graph (DAG) then a knowledge graph representation could be useful? I would like to find a solution that can do this well.

rachmadaniHaryono commented 4 years ago

Seperate the image into smaller sub-images that have reasonable overlaps. That way it can detect regions of the images without changing aspect ratios or downgrading resolutions.

i still can't imagine how to do that. if someone make implementation of it, please notify me.

parent-children tagging system

i just thought something about removing parent tag

it is possible that even if parent tag rely only on children tag(s), it have to be calculated because at least one of the children tag may have low image count and filtered

chinese_text
english_text
engrish_text
french_text
german_text
korean_text
romaji_text
russian_text
simplified_chinese_text
thai_text
ranguage?

maybe instead remove those tag, just merge it into single tag e.g. 'text'. this way model can recognize text but don't have to guess which language is it.

but i doubt this will work with name and username

another idea is just merge those tag groups (text, name, username) into single tag e.g. text

long image

i checked my image library and found that long image with full body is still recognizable even if it is downsized. but if model only trained with that tag, there is possibility that long image will bias to full body tag

e:

parent children tag

afaik there is no program yet to parse danbooru to get the data. i may (or may not) create simple script to do that

long image statistic

@KichangKim can you give statistic of long image on dataset, like actual width height and tag count?

KichangKim commented 4 years ago

Long images are handled as just "small objects with large empty space" until it has clean backgrounds because it will be padded with "edge" mode (edge pixels are duplicated for padding). So it may not be critical problem I think.

Pre tag filtering (merge confusing tags into single one and so on) may be helpful, but it needs additional knowledge for tag itself and make the system complex.

libid0nes commented 4 years ago

i still can't imagine how to do that. if someone make implementation of it, please notify me.

Can't you ask the author? As far as I know, he implemented this feature on his website: http://kanotype.iptime.org:8003/deepdanbooru

KichangKim commented 4 years ago

@Libidine Web demo implements evaluation-time cropping, but it is not part of deepdanbooru itself currently.

But you can easily implement yourself by using numpy's subarray. The main idea is that crop input image into multiple small regions and evaluate all. Then get max score of it. Some tags are affected by cropping (ex, number-related tags, lower/upper tags, frame related tags and so on) so you should ignore or control that.

Of course, it need more computation time depend on the number of subregions.

rachmadaniHaryono commented 4 years ago

wait i thought @DonaldTsang propose new method instead the current one

from my understanding the image will be resized to proposed size e.g. 299x299 or 512x512 and the rest will be padded by with "edge" mode (copied from above response, still not quite understand the edge mode yet)

that is different from this part

That way it can detect regions of the images without changing aspect ratios or downgrading resolutions.

fengyueyeah commented 3 years ago

@rachmadaniHaryono I think that It can't be compared correctly because its dataset is changed, but here is last training logs: v1:

Epoch[29] Loss=1416.928884, P=0.773589, R=0.502963, F1=0.609590, Speed = 47.5 samples/s, 60.00 %, ETA = 2019-12-31 15:28:03
Epoch[29] Loss=1343.514524, P=0.779304, R=0.518631, F1=0.622791, Speed = 47.5 samples/s, 60.00 %, ETA = 2019-12-31 15:14:55
Epoch[29] Loss=1406.559717, P=0.777394, R=0.508826, F1=0.615071, Speed = 47.2 samples/s, 60.00 %, ETA = 2019-12-31 16:41:41

v3:

Epoch[30] Loss=540.683345, P=0.788256, R=0.545070, F1=0.644485, Speed = 22.9 samples/s, 61.25 %, ETA = 2020-02-25 03:23:44
Epoch[30] Loss=536.273903, P=0.782580, R=0.550326, F1=0.646218, Speed = 23.1 samples/s, 61.25 %, ETA = 2020-02-25 00:30:51
Epoch[30] Loss=563.256741, P=0.784784, R=0.536157, F1=0.637072, Speed = 23.0 samples/s, 61.25 %, ETA = 2020-02-25 01:56:43

P=precision, R=recall, F1=f1 score for training dataset. DeepDanbooru doesn't have validation set.

According to the log, v3 is much better than v1. What's the hyper-parameters setting for v3, such as learning rate(or scheduler) and batch size. I found the learning rate for v2 is 0.001 in the default project and not changed. By the way, what do you think v3 benifits most from, model arch, input size, data filter or hyper-params?

KichangKim / DeepDanbooru

Batchsize and training material requirements. #12