Closed ftnext closed 4 years ago
textに改行コードを含むのでcsvモジュールで読み込んで数を数える必要がある(wcでは数がわからない)
$ unzip ~/Downloads/nlp-getting-started.zip -d data/ # ブラウザでダウンロード、dataディレクトリは作ってある
$ wc -l data/*
3264 data/sample_submission.csv
3700 data/test.csv
8562 data/train.csv
15526 total
$ python scripts/overview.py
Row count: train 7613 (70.0%), test 3263 (30.0%)
In train data: positive 3271 (43.0%), negative 4342 (57.0%)
Blank text: in train 0 (0.0%) in test 0 (0.0%)
Blank keyword: all 87 (0.8%)
in train 61 (0.8%), in test 26 (0.8%)
Blank location: all 3638 (33.4%)
in train 2533 (33.3%), in test 1105 (33.9%)
Unique text: all 10678 (98.2%)
in train 7503 (98.6%) in test 3243 (99.4%)
same text count in train and test: 68