Using publicly available pretrained embeddings such as glove and fasttext

woiza commented 4 years ago

Hi,

could you please give some guidance on how to use pretrained embeddings? I am not sure if it configured correctly, because my results are worse than with embeddings trained by your model/toolkit on my relatively small corpus. Furthermore, checkpoints are very large with pretrained embeddings (72GB). What is the correct configuration if I want to use pretrained embeddings, e.g. Fasttext Wikipedia, and reuse my train/val/test json files?

Embeddings (300 dimensions) are downloaded from: https://fasttext.cc/docs/en/pretrained-vectors.html https://deepset.ai/german-word-embeddings

My configuration is as follows:

{

"task_info":{ "label_type": "multi_label", "hierarchical": false, "hierar_taxonomy": "data/a.taxonomy", "hierar_penalty": 0.000001 }, "device": "cuda", "model_name": "TextRNN", "checkpoint_dir": "checkpoint_dir_a_TextRNNLSTMFastTextWiki", "model_dir": "trained_model_a_TextRNNLSTMFastTextWiki", "data": { "train_json_files": [ "a5_stratified_train.json" ], "validate_json_files": [ "a5_stratified_val.json" ], "test_json_files": [ "a5_stratified_test.json" ], "generate_dict_using_json_files": true, "generate_dict_using_all_json_files": true, "generate_dict_using_pretrained_embedding": true, "dict_dir": "dict_a_TextRNNLSTMFastTextWiki", "num_worker": 4 }, "feature": { "feature_names": [ "token" ], "min_token_count": 2, "min_char_count": 2, "token_ngram": 0, "min_token_ngram_count": 0, "min_keyword_count": 0, "min_topic_count": 2, "max_token_dict_size": 1000000, "max_char_dict_size": 150000, "max_token_ngram_dict_size": 10000000, "max_keyword_dict_size": 100, "max_topic_dict_size": 100, "max_token_len": 512, "max_char_len": 1024, "max_char_len_per_token": 4, "token_pretrained_file": "fasttext/wiki.de.vec", "keyword_pretrained_file": "" }, "train": { "batch_size": 8, "start_epoch": 1, "num_epochs": 20, "num_epochs_static_embedding": 0, "decay_steps": 1000, "decay_rate": 1.0, "clip_gradients": 100.0, "l2_lambda": 0.0, "loss_type": "BCEWithLogitsLoss", "sampler": "fixed", "num_sampled": 5, "visible_device_list": "0", "hidden_layer_dropout": 0.5 }, "embedding": { "type": "embedding", "dimension": 300, "region_embedding_type": "word_context", "region_size": 5, "initializer": "uniform", "fan_mode": "FAN_IN", "uniform_bound": 0.25, "random_stddev": 0.01, "dropout": 0.0 }, "optimizer": { "optimizer_type": "Adam", "learning_rate": 0.008, "adadelta_decay_rate": 0.95, "adadelta_epsilon": 1e-08 }, "TextRNN": { "hidden_dimension": 64, "rnn_type": "LSTM", "num_layers": 1, "doc_embedding_type": "Attention", "attention_dimension": 16, "bidirectional": true }, "eval": { "text_file": "data/a_dev.json", "threshold": 0.5, "dir": "eval_dir/a-TextRNNLSTMFastTextWiki/", "batch_size": 1024, "is_flat": true, "top_k": 31, "model_dir": "checkpoint_dir_a_TextRNNLSTMFastTextWiki/TextRNN_best" }, "log": { "logger_file": "log_test_a_TextRNNFastText", "log_level": "warn" } }

wangjiedlut commented 4 years ago

I meet the same question. Is the parameter unsuitble for this model?

coderbyr commented 4 years ago

the generate_dict_using_pretrained_embedding should be set to false, or it will use all words/tokens in pre-trained embedding to generate dict file, which may cause parameters scales. And for the performance of embedding, is the pre-trained embedding related with the training set? or you can try to set the num_epochs_static_embedding to n(for example, n=2), it'll keep word embeddings not updated in first n epoches.

Tencent / NeuralNLP-NeuralClassifier

Using publicly available pretrained embeddings such as glove and fasttext #71