Closed dineshkh closed 3 years ago
Hi!
train.source
, dev.source
, train.target
, and dev.target
with training and validation for source and target respectively). The source file will contain each line a sentence (e.g., Einstein was a [START_ENT] German [END_ENT] physicist.
) and the corresponding line in the target will its target prediction (e.g., Germany
).fariseq
. You do not need to providfe --source-lang
, --target-lang
, --srcdict
, or --tgtdict
--task
and also there is no need to specify --source-lang
and --target-lang
as they are already specified by the name of the filesThanks Nicola for the reply. Just one clarification question: Is there is any example train.source and train.target files available which I can see ?
Unfortunately no. I have no longer access to the machines I used for these experiments.
is it okay to use 'fairseq_entity_disambiguation_blink' model provided by you for BPE preprocessing and to binarize the dataset using https://github.com/facebookresearch/GENRE/blob/main/scripts_genre/preprocess_fairseq.sh script ?
yes. all models except the multilingual GENRE use the same tokenizer
Thanks Nicola training on new entities is working fine. I have one more questions what batch size did you use while training Entity Disambiguation model ? To increase batch size what argument should I change (--batch-size or --required-batch-size-multiple or --max-tokens) ?
I trained with variable batch size with fairseq-train
therefore I am unable to answer it. Basically in a cluster, the training script is trying to fill all GPU memory with data as much as possible. I believe the average batch-size was around 4k.
what is the value of ---max-tokens you used for training Entity Disambiguation model 32 GPUs (as reported in paper) ? what would be your recommendation for number of steps and --max-tokens value when training model on 1k examples with 8 GPUs?
I used was --max-tokens 1024
. If your data is just 1k examples I would also use a very small learning rate
Hi Nicola,
I want to train the Entity Disambiguation model reported in the GENRE paper (not mGENRE) from scratch on my own data. Can you please tell the steps to generate the training data in the format expected by GENRE? Which scripts to use and in what order?
I can see scripts: https://github.com/facebookresearch/GENRE/blob/main/scripts_genre/preprocess_fairseq.sh (what values to provide for arguments --source-lang, --target-lang, --srcdict, --tgtdict as I am not working in the multi-lingual setting)
https://github.com/facebookresearch/GENRE/blob/main/scripts_genre/train.sh (Again what values to provide for arguments --source-lang, --target-lang, --task).