file 'preprocess.py' has problom

chostyouwang commented 3 years ago

when we run file 'preprocess.py;. the source file(.json) about 2MB was convert to pt file which is need for bert(.pt)about 1KB.why????when we train re-train the pipeline model,it show 0 examples of file. how to get the correct file'preprocess.py'

chostyouwang commented 3 years ago

when i run it (preprocess.py)on colab, it will stop spontaneously.so we can not get the preproessd data.

RowitZou commented 3 years ago

Could you please provide more details about your error log? The wrong data format might lead to preprocess failure. An example of your input data is also helpful for me to solve your problem.

chostyouwang commented 3 years ago

Could you please provide more details about your error log? The wrong data format might lead to preprocess failure. An example of your input data is also helpful for me to solve your problem.

operate on colab !python ./src/preprocess.py -raw_path json_data -save_path bert_data -bert_dir bert/chinese_bert -log_file logs/preprocess.log -emb_path pretrain_emb/word2vec -tokenize -truncated -add_ex_label [2021-03-30 04:30:02,054 INFO] loading vocabulary file bert/chinese_bert/vocab.txt [2021-03-30 04:30:02,078 INFO] Processing json_data/ali.dev.0.json [2021-03-30 04:30:02,504 INFO] loading Word2VecKeyedVectors object from pretrain_emb/word2vec [2021-03-30 04:30:02,548 INFO] setting ignored attribute vectors_norm to None [2021-03-30 04:30:02,549 INFO] loaded pretrain_emb/word2vec 50 100 150 200 ^c operate on windows pycharm [2021-03-30 12:32:15,892 INFO] loading vocabulary file bert/chinese_bert\vocab.txt [2021-03-30 12:32:15,924 INFO] Processing json_data\ali.dev.0.json [2021-03-30 12:32:16,201 INFO] loading Word2VecKeyedVectors object from pretrain_emb/word2vec [2021-03-30 12:32:16,304 INFO] setting ignored attribute vectors_norm to None [2021-03-30 12:32:16,304 INFO] loaded pretrain_emb/word2vec [2021-03-30 12:32:16,524 INFO] Processed instances 0 [2021-03-30 12:32:16,524 INFO] Saving to bert_data\bert.pt_data\ali.dev.0.bert.pt Traceback (most recent call last): File "./src/preprocess.py", line 52, in data_builder.format_to_bert(args) FileNotFoundError: [Errno 2] No such file or directory: 'bert_data\bert.pt_data\ali.dev.0.bert.pt' 一样地命令，Windows上就不处理该文件，直接保存，生成的pt文件就1kb 如果-tokenize -truncated -add_ex_label三个值设为false（命令行不带这三个参数），确实可以生成pt文件，但是在train过程中，tgt_label的值会缺失，导致也无法正常训练。

RowitZou commented 3 years ago

Perhaps on the Windows platform, Pycharm could not recognize the file path. Please make sure the path of the input file is correct.

Besides, Chinese characters are encoded by GBK on the windows platform, while on Linux they are encoded by UTF8. It may lead to a crash when you process '-tokenize' or '-add_ex_label'.

RowitZou / topic-dialog-summ

file 'preprocess.py' has problom #2