Closed chostyouwang closed 2 years ago
when i run it (preprocess.py)on colab, it will stop spontaneously.so we can not get the preproessd data.
Could you please provide more details about your error log? The wrong data format might lead to preprocess failure. An example of your input data is also helpful for me to solve your problem.
Could you please provide more details about your error log? The wrong data format might lead to preprocess failure. An example of your input data is also helpful for me to solve your problem.
operate on colab
!python ./src/preprocess.py -raw_path json_data -save_path bert_data -bert_dir bert/chinese_bert -log_file logs/preprocess.log -emb_path pretrain_emb/word2vec -tokenize -truncated -add_ex_label
[2021-03-30 04:30:02,054 INFO] loading vocabulary file bert/chinese_bert/vocab.txt
[2021-03-30 04:30:02,078 INFO] Processing json_data/ali.dev.0.json
[2021-03-30 04:30:02,504 INFO] loading Word2VecKeyedVectors object from pretrain_emb/word2vec
[2021-03-30 04:30:02,548 INFO] setting ignored attribute vectors_norm to None
[2021-03-30 04:30:02,549 INFO] loaded pretrain_emb/word2vec
50
100
150
200
^c
operate on windows pycharm
[2021-03-30 12:32:15,892 INFO] loading vocabulary file bert/chinese_bert\vocab.txt
[2021-03-30 12:32:15,924 INFO] Processing json_data\ali.dev.0.json
[2021-03-30 12:32:16,201 INFO] loading Word2VecKeyedVectors object from pretrain_emb/word2vec
[2021-03-30 12:32:16,304 INFO] setting ignored attribute vectors_norm to None
[2021-03-30 12:32:16,304 INFO] loaded pretrain_emb/word2vec
[2021-03-30 12:32:16,524 INFO] Processed instances 0
[2021-03-30 12:32:16,524 INFO] Saving to bert_data\bert.pt_data\ali.dev.0.bert.pt
Traceback (most recent call last):
File "./src/preprocess.py", line 52, in
Perhaps on the Windows platform, Pycharm could not recognize the file path. Please make sure the path of the input file is correct.
Besides, Chinese characters are encoded by GBK on the windows platform, while on Linux they are encoded by UTF8. It may lead to a crash when you process '-tokenize' or '-add_ex_label'.
when we run file 'preprocess.py;. the source file(.json) about 2MB was convert to pt file which is need for bert(.pt)about 1KB.why????when we train re-train the pipeline model,it show 0 examples of file. how to get the correct file'preprocess.py'