ftnext / atmaCup10-paintings-likes

https://www.guruguru.science/competitions/16/
MIT License
0 stars 0 forks source link

講義2で脳死で追加された特徴量に近づける [CV: 1.0581] [LB: 1.0319] #8

Closed ftnext closed 3 years ago

ftnext commented 3 years ago

ぱっと見きかなそうな特徴量は除いて洗い出し、必要なものは実装する

ftnext commented 3 years ago

OneHotEncoding

trainの最大カウントを()で示す

NaNが多くても特徴量として増やしてみる(LightGBMなら影響されにくいから)ということか

ftnext commented 3 years ago

dataset v1.5: OneHotEncodingが講義2と同じ [CV 1.0652]

$ python -i preprocess.py data/datasets/ data/preprocessed/v1.5
# train, testともに33MB

$ python training.py data/preprocessed/v1.5/ data/datasets/train.csv submissions/
train: (12026, 1066)
test: (12008, 1066)
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
Training until validation scores don't improve for 100 rounds
[500]   valid_0's rmse: 1.07444
Early stopping, best iteration is:
[412]   valid_0's rmse: 1.07278
Fold 0 RMSLE: 1.0728
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
Training until validation scores don't improve for 100 rounds
[500]   valid_0's rmse: 1.07319
Early stopping, best iteration is:
[485]   valid_0's rmse: 1.07228
Fold 1 RMSLE: 1.0723
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[356]   valid_0's rmse: 1.04132
Fold 2 RMSLE: 1.0413
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[378]   valid_0's rmse: 1.07961
Fold 3 RMSLE: 1.0796
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
Training until validation scores don't improve for 100 rounds
[500]   valid_0's rmse: 1.0647
Early stopping, best iteration is:
[430]   valid_0's rmse: 1.05937
Fold 4 RMSLE: 1.0594
--------------------------------------------------
FINISHED | Whole RMSLE: 1.0652

20210312-135550_data_preprocessed_v1 5

features count: 1066
title__lang=__label__en 118844.63220578432
StringLength__sub_title 65556.89275246859
size_h  55616.923470860114
size_w  39223.87466659397
more_title__lang=__label__en    31559.823882102966
title__lang=__label__nl 25121.263835430145
CE__principal_maker 12894.914893032517
dating_year_early   10398.156597647816
dating_year_late    8812.289369143546
description_tfidf_2 7871.548617511988
StringLength__more_title    7238.329542626743
acquisition_date=1994-01-01T00:00:00    7200.647782564163
StringLength__title 6007.2990922778845
description_tfidf_9 5623.561638459563
StringLength__description   5138.471152223647
description_tfidf_10    4998.281978905201
CE__acquisition_method  4884.856794159859
more_title__lang=__label__nl    4643.109343677759
description_tfidf_16    4450.151097655296
description_tfidf_0 4010.74718981795
StringLength__long_title    3995.5137103907764
description_tfidf_44    3874.518201753497
description_tfidf_47    3676.9679602347314
description_tfidf_5 3655.5469564199448
description_tfidf_1 3590.7993990182877
description_tfidf_4 3535.1186010837555
description_tfidf_27    3467.5155571103096
description_tfidf_22    3411.4941940903664
description_tfidf_31    3322.390802204609
acquisition_method=transfer 3311.1518894433975
description_tfidf_6 3203.190434006974
description_tfidf_7 2941.157238088548
description_tfidf_3 2887.5754666924477
description_tfidf_49    2862.502710789442
dating_period=19    2818.2294959425926
description_tfidf_13    2796.575230151415
description_tfidf_45    2761.0347990207374
description_tfidf_21    2533.8030881285667
description_tfidf_14    2500.7471777647734
description_tfidf_48    2496.626448661089
more_title__lang=   2495.551609516144
description_tfidf_28    2490.9852796792984
description_tfidf_42    2461.801316257566
description_tfidf_46    2451.1162937805057
description_tfidf_40    2385.0348535478115
description_tfidf_39    2353.476946234703
description_tfidf_18    2307.789521291852
CE__title   2296.7285529058427
description_tfidf_35    2289.0134964175522
description_tfidf_29    2273.5260899960995
ftnext commented 3 years ago

CountEncoding

ftnext commented 3 years ago

文字列長

more_titleが落ちている(→落とさず残す考え)

ftnext commented 3 years ago

dataset v1.5.1: CountEncodingと文字列長を追加 [CV: 1.0581] [LB: 1.0319]

$ python preprocess.py data/datasets/ data/preprocessed/v1.5.1
train: (12026, 1082)
test: (12008, 1082)

$ python training.py data/preprocessed/v1.5.1/ data/datasets/train.csv submissions/
train: (12026, 1082)
test: (12008, 1082)
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
Training until validation scores don't improve for 100 rounds
[500]   valid_0's rmse: 1.0618
Early stopping, best iteration is:
[425]   valid_0's rmse: 1.05996
Fold 0 RMSLE: 1.0600
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[279]   valid_0's rmse: 1.0677
Fold 1 RMSLE: 1.0677
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
Training until validation scores don't improve for 100 rounds
[500]   valid_0's rmse: 1.03321
Early stopping, best iteration is:
[435]   valid_0's rmse: 1.03184
Fold 2 RMSLE: 1.0318
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[299]   valid_0's rmse: 1.08273
Fold 3 RMSLE: 1.0827
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
Training until validation scores don't improve for 100 rounds
[500]   valid_0's rmse: 1.04863
Early stopping, best iteration is:
[411]   valid_0's rmse: 1.0474
Fold 4 RMSLE: 1.0474
--------------------------------------------------
FINISHED | Whole RMSLE: 1.0581

20210313-012049_data_preprocessed_v1 5 1

features count: 1082
title__lang=__label__en 168383.14408916235
size_h  59977.0106382519
StringLength__sub_title 53420.976950764656
size_w  27903.394222528674
CE__acquisition_date    26692.429003566504
more_title__lang=__label__en    11337.206842660904
dating_year_late    8089.388304844499
StringLength__more_title    7929.445962419733
CE__principal_maker 7928.032715566456
dating_year_early   7566.1707446575165
description_tfidf_10    5573.114014357328
CE__dating_period   5323.992191135883
CE__acquisition_credit_line 5248.245001330972
title__lang=__label__nl 4954.860521554947
CE__description 4911.8659618496895
description_tfidf_46    4445.551183119416
description_tfidf_9 4378.325389226899
CE__acquisition_method  4309.816353216767
description_tfidf_16    4306.426808230579
CE__principal_or_first_maker    4194.883913826197
StringLength__title 4188.884796886705
more_title__lang=__label__nl    3963.5859320759773
StringLength__description   3750.246826261282
StringLength__long_title    3702.8299085581675
description_tfidf_22    3499.176881402731
description_tfidf_0 3309.843187302351
CE__sub_title   3288.9826562441885
description_tfidf_28    3207.533474355936
CE__dating_sorting_date 3158.271821387112
description_tfidf_2 3097.139425635338
description_tfidf_5 2938.435487974435
CE__dating_year_late    2934.583312444389
acquisition_date=1994-01-01T00:00:00    2784.607858657837
long_title__lang=__label__en    2761.530075713992
description_tfidf_1 2679.9649018645287
description_tfidf_4 2570.2622108235955
CE__dating_presenting_date  2485.637671297416
description_tfidf_21    2467.6984004974365
description_tfidf_47    2467.3162631988525
description_tfidf_31    2459.160049557686
description_tfidf_6 2445.1297653466463
description_tfidf_49    2411.843475818634
description_tfidf_13    2392.2614911198616
description_tfidf_48    2352.342723816633
description_tfidf_3 2347.935442060232
description_tfidf_29    2314.170175552368
description_tfidf_14    2214.7736707031727
description_tfidf_18    2208.7245542109013
StringLength__principal_maker   2205.4734529554844
description_tfidf_41    2180.937787041068