ftnext / real-or-not-nlp

https://www.kaggle.com/c/nlp-getting-started
MIT License
0 stars 0 forks source link

データの件数を調べる #1

Closed ftnext closed 4 years ago

ftnext commented 4 years ago
ftnext commented 4 years ago

textに改行コードを含むのでcsvモジュールで読み込んで数を数える必要がある(wcでは数がわからない)

$ unzip ~/Downloads/nlp-getting-started.zip -d data/  # ブラウザでダウンロード、dataディレクトリは作ってある
$ wc -l data/*
    3264 data/sample_submission.csv
    3700 data/test.csv
    8562 data/train.csv
   15526 total
ftnext commented 4 years ago
$ python scripts/overview.py
Row count: train 7613 (70.0%), test 3263 (30.0%)
In train data: positive 3271 (43.0%), negative 4342 (57.0%)
Blank text: in train 0 (0.0%) in test 0 (0.0%)
Blank keyword: all 87 (0.8%)
  in train 61 (0.8%), in test 26 (0.8%)
Blank location: all 3638 (33.4%)
  in train 2533 (33.3%), in test 1105 (33.9%)
Unique text: all 10678 (98.2%)
  in train 7503 (98.6%) in test 3243 (99.4%)
  same text count in train and test: 68