DreamInvoker / GAIN

Source code for EMNLP 2020 paper: Double Graph Based Reasoning for Document-level Relation Extraction
MIT License
142 stars 30 forks source link

Question about text's length and use of other models #4

Closed alejandrojcastaneira closed 3 years ago

alejandrojcastaneira commented 3 years ago

First of all, I would like to thank you for your great work! and paper.

I have been experimenting with the training scripts you have proposed and they have worked well.

So I would like to ask two questions.

Are there any limitations on the size of the sentences/documents that the model which uses the BERT Encoder could process? since BERT is limited to only 512 sub-words units.

And if I would like to experiment with other languages and therefore use other encoders, say bert-base-multilingual-cased or xlm-roberta-base, is it enough to create a folder for these models and download/place the files pytotch_model_bin, vocab.txt etc., accordingly?

I also imagine that it would be necessary to create a GAIN_BERT_MUL training script that points-out to the folder of the new model and modify the parameters as required.

Best regards

DreamInvoker commented 3 years ago

Thank you for your attention!

  1. On the max length of the input text: Yes, the max length of the sub-word token is set to 512 in this line for GAIN_BERT inputs. But we didn't write corresponding processing code to fit this limitation, because we found that all documents in DocRED have sub-word token number less than 512, otherwise there will be an Assert Error when running code. Thank you for pointing out this for a more scalable code and we will fix this as soon as possible. There is no limitation for the number of sentences.

  2. On training using other Pre-trained Language Models: Since the different types of pretraining language models (e.g., RoBERTa family, BERT family) require a different format of input data, if you want to use other types of PLMs, you should change the codes in this Class to fit the format of their input. If you want to use bert-base-multilingual-cased, which is the same type as BERT, creating a folder for it and download/place the files pytotch_model_bin, vocab.txt, etc., accordingly, is indeed enough.

I also imagine that it would be necessary to create a GAIN_BERT_MUL training script that points-out to the folder of the new model and modify the parameters as required.

Yes, exactly.

Best

alejandrojcastaneira commented 3 years ago

Thank you very much! for your quick reply.

I think that then it would be easier for us for our custom domain and languages to experiment with bert_multilingual_case in the beginning and test the results.

Some of the documents we process have more than 512 sub-tokens, so if there is a possibility to extend this limit to longer documents that would be great!

I will keep you updated on how our experiments turn out and thank you again for the good work.

DreamInvoker commented 3 years ago

We have fixed this scalability issue in this commit

Good luck with your experiments.

alejandrojcastaneira commented 3 years ago

If I understood this commit correctly the text will be truncated to 512 sub_word units and will not raise an assertion, but then the rest will be discarded. So would there be a way to use all the text? maybe using some sliding window technique over the text and then combining the results of all the predictions? I assume this would take many modifications.

DreamInvoker commented 3 years ago

Yes, the common method is to truncate them and it is indeed a simplification. You could otherwise try to use other pre-training encoders such as longformer or BigBird to cope with this kind of limitation. Or you could divide the long document into several pieces and use a method like SicREX to ensemble them.

alejandrojcastaneira commented 3 years ago

Thank you for the quick reply!

I was able to configure the bert_multilingual_cased model, however, I noted the models you used were both uncased, is there any reason for this?

I observed a file inside DocRed called word2id, which is not clear to me about its function, I assume that it contains all the words available in the text in a lowercase form, if I want to use a cased model like bert_multilingual_cased on my own texts, I should make a word2id file for my own text, but in this case, would it be cased? Is there some intermediate processing step in GAIN where the text is taken to lowercase?

DreamInvoker commented 3 years ago

We use the uncased version because it is a common best practice among previous work.

The word2id.json file is for the GAIN_GloVe model which does not use BERT as the encoder. When using BERT, we use the vocabulary provided by transformers, i.e., the vocab.txt file in directory PLM/xxx/. If you want to use the cased version of BERT-family model, you could change the configuration in PLM/xxx/config.json.

alejandrojcastaneira commented 3 years ago

It's correct to assume that there it's no limitation of the input size in the GAIN_GloVe model?

DreamInvoker commented 3 years ago

No, we limit the input size within 512 for GAIN_GloVe in Dataset Class

alejandrojcastaneira commented 3 years ago

I saw the limitation on the source code, If we increase this limit, should it be possible to run the model using LSTM as encoder and Globe vectors? if this it's the case, we should provide also the word2id.json and vec.npy files as you mentioned before, I had the doubt of how the vec.npy file it's structured?

DreamInvoker commented 3 years ago

If you want to use a large limit, just increasing it is OK to run the model.

vec.npy is the DocRED-specific version extracted from GloVe pre-trained embedding.

alejandrojcastaneira commented 3 years ago

As you suggested, I have increased the number of words that can be loaded by a single sample in the dataset class to be able to process longer documents using the LSTM encoder.

I also created my own word2id file that contains all the vocabulary of my datasets and the corresponding vect.npy file with the vectors of all the words inside the vocabulary. Since these vectors are different from the defaults, I increased the size of vocabulary to the 1001000 and embedding size to 300 inside the run.sh script and config.py.

The dataset loads well and is pre-processed, but I am getting this error in the first iteration.

2020-10-30 09:54:44.632033 training from scratch with lr 0.001
2020-10-30 09:54:50.493775 begin..
Traceback (most recent call last):
  File "train.py", line 231, in <module>
    train(opt)
  File "train.py", line 138, in train
    ht_pair_distance=d['ht_pair_distance']
  File "/home/ale/anaconda3/envs/info_extraction/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/janzz11/PycharmProjects/Janzz_Parser_API/server/entity_relations/GAIN/code/models/GAIN.py", line 122, in forward
    features = GCN_layer(graph_big, {"node": features})["node"]  # [total_mention_nums, gcn_dim]
  File "/home/ale/anaconda3/envs/info_extraction/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/janzz11/PycharmProjects/Janzz_Parser_API/server/entity_relations/GAIN/code/models/GAIN.py", line 616, in forward
    hs = self.conv(g, inputs, mod_kwargs=wdict)
  File "/home/ale/anaconda3/envs/info_extraction/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ale/anaconda3/envs/info_extraction/lib/python3.6/site-packages/dgl/nn/pytorch/hetero.py", line 163, in forward
    **mod_kwargs.get(etype, {}))
  File "/home/ale/anaconda3/envs/info_extraction/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ale/anaconda3/envs/info_extraction/lib/python3.6/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 155, in forward
    graph.srcdata['h'] = feat
  File "/home/ale/anaconda3/envs/info_extraction/lib/python3.6/site-packages/dgl/view.py", line 296, in __setitem__
    self._graph._set_n_repr(self._ntid, self._nodes, {key : val})
  File "/home/ale/anaconda3/envs/info_extraction/lib/python3.6/site-packages/dgl/heterograph.py", line 2437, in _set_n_repr
    ' Got %d and %d instead.' % (nfeats, num_nodes))
dgl._ffi.base.DGLError: Expect number of features to match number of nodes (len(u)). Got 133 and 137 instead.
DreamInvoker commented 3 years ago

Could you give me more debug information?

alejandrojcastaneira commented 3 years ago

Sure, here it's the complete log file:

https://drive.google.com/drive/folders/1CH7u0BZKp8N_25gOoYroF6crPwWKpWK4?usp=sharing

I was able to train previously with the GAIN_Glove model on the DocREd data, and also trained on this custom data using the BERT model.

DreamInvoker commented 3 years ago

Is ../data/prepro_data/train_GloVe.pkl created by the GAIN_GloVe model on DocRED, or GAIN_GloVe model on your custom dataset?

I observed that this .pkl file was loaded from the file.

Reading data from ../data/train_annotated.json.

load preprocessed data from ../data/prepro_data/train_GloVe.pkl.

Reading data from ../data/dev.json.

load preprocessed data from ../data/prepro_data/dev_GloVe.pkl.
alejandrojcastaneira commented 3 years ago

It's created from my custom dataset, If I delete it from the /prepro_data it's re-created again every time I run the model.

DreamInvoker commented 3 years ago

You could try to debug the code and print the graph.number_of_nodes() here and mention num here. This error may be caused by the inconsistency between their value, possibly due to the wrong mention_id list created from your custom dataset.

alejandrojcastaneira commented 3 years ago

I did as you suggested and I have updated and uploaded the log file.

If it's ok for you? I could share some sample data that I'm using for training, together with the ner2id, rel2id.json, etc. files.

DreamInvoker commented 3 years ago

I observe that you use batchsize = 1 and correctly pass some batch (because the log outputs three times of mention num.). Maybe there's something wrong with some of your sample data. You could use shuffle = False here to fix the order of the batch and check if some samples have wrong inputs such as mention_id here.

alejandrojcastaneira commented 3 years ago

I have updated these changes now and also the logs, I see there are 55 entities in the annotate sampled however in the mention_id tensor there are only represented 54, maybe that's why it's raising the error, you are right, I have to check if there possible is a possible inconsistency in the creation of this sample.

DreamInvoker commented 3 years ago

Yes, this is exactly the reason.

alejandrojcastaneira commented 3 years ago

I have been debugging this, in the error sample, the document it's truncated, so that the second to last entity is shorted by half, and the last entity is omitted, however, I think this seems to happen on the dataloader part after the sample it's pre-processed, otherwise the assert mention_id.max() == graph.number_of_nodes() - 1 in the DGLREDataset class will trigger before creating the file.

alejandrojcastaneira commented 3 years ago

Well.. long story short. The "UNK" token should not be the index 0 in the word2id.py file, because this makes it that it mixes with the internal representation of each sample and makes it not to count it as a used word here so the number of words in the sample is then truncated.

After changed this, it's training perfectly.

DreamInvoker commented 3 years ago

Good job!