facebookresearch / GENRE

Autoregressive Entity Retrieval
Other
763 stars 102 forks source link

Regarding training of Entity Disambiguation model reported in GENRE paper on a new data #60

Closed dineshkh closed 3 years ago

dineshkh commented 3 years ago

Hi Nicola,

I want to train the Entity Disambiguation model reported in the GENRE paper (not mGENRE) from scratch on my own data. Can you please tell the steps to generate the training data in the format expected by GENRE? Which scripts to use and in what order?

I can see scripts: https://github.com/facebookresearch/GENRE/blob/main/scripts_genre/preprocess_fairseq.sh (what values to provide for arguments --source-lang, --target-lang, --srcdict, --tgtdict as I am not working in the multi-lingual setting)

https://github.com/facebookresearch/GENRE/blob/main/scripts_genre/train.sh (Again what values to provide for arguments --source-lang, --target-lang, --task).

nicola-decao commented 3 years ago

Hi!

  1. What you need to do is to create 4 files (called train.source, dev.source, train.target, and dev.target with training and validation for source and target respectively). The source file will contain each line a sentence (e.g., Einstein was a [START_ENT] German [END_ENT] physicist.) and the corresponding line in the target will its target prediction (e.g., Germany).
  2. then use https://github.com/facebookresearch/GENRE/blob/main/scripts_genre/preprocess_fairseq.sh to convert the files for fariseq. You do not need to providfe --source-lang, --target-lang, --srcdict, or --tgtdict
  3. then you can run https://github.com/facebookresearch/GENRE/blob/main/scripts_genre/train.sh but there is already an argument for --task and also there is no need to specify --source-lang and --target-lang as they are already specified by the name of the files
dineshkh commented 3 years ago

Thanks Nicola for the reply. Just one clarification question: Is there is any example train.source and train.target files available which I can see ?

nicola-decao commented 3 years ago

Unfortunately no. I have no longer access to the machines I used for these experiments.

dineshkh commented 3 years ago

is it okay to use 'fairseq_entity_disambiguation_blink' model provided by you for BPE preprocessing and to binarize the dataset using https://github.com/facebookresearch/GENRE/blob/main/scripts_genre/preprocess_fairseq.sh script ?

nicola-decao commented 3 years ago

yes. all models except the multilingual GENRE use the same tokenizer

dineshkh commented 3 years ago

Thanks Nicola training on new entities is working fine. I have one more questions what batch size did you use while training Entity Disambiguation model ? To increase batch size what argument should I change (--batch-size or --required-batch-size-multiple or --max-tokens) ?

nicola-decao commented 3 years ago

I trained with variable batch size with fairseq-train therefore I am unable to answer it. Basically in a cluster, the training script is trying to fill all GPU memory with data as much as possible. I believe the average batch-size was around 4k.

dineshkh commented 3 years ago

what is the value of ---max-tokens you used for training Entity Disambiguation model 32 GPUs (as reported in paper) ? what would be your recommendation for number of steps and --max-tokens value when training model on 1k examples with 8 GPUs?

nicola-decao commented 3 years ago

I used was --max-tokens 1024. If your data is just 1k examples I would also use a very small learning rate