training from scratch - Githubissues

muiPomeranian commented 4 years ago

Thanks for this amazing code base!

I am a newbie to understand this code-base especially the "pretrain from scratch".

What kind of public dataset I can use ? am I suppose to use wiki and book corpus as Bert? I found this from google's repo: https://github.com/google-research/bert#pre-training-data

**Pre-training data We will not be able to release the pre-processed datasets used in the paper. For Wikipedia, the recommended pre-processing is to download the latest dump, extract the text with WikiExtractor.py, and then apply any necessary cleanup to convert it into plain text.

Unfortunately the researchers who collected the BookCorpus no longer have it available for public download. The Project Guttenberg Dataset is a somewhat smaller (200M word) collection of older books that are public domain.**

would it be sufficient(just mock the pretrain) to run wiki-dump + Guttenberg dataset? Would it be ok to just run your preprocess command with this dataset? may you give me more explicit direction?

1-2. Did you also use Wiki-dump + Guttenberg ?

2. When I run this command: python train.py /path/to/preprocessed_data --total-num-update 2400000 --max-update 2400000 --save-interval 1 --arch cased_bert_pair_large --task span_bert --optimizer adam --lr-scheduler polynomial_decay --lr 0.0001 --min-lr 1e-09 --criterion span_bert_loss --max-tokens 4096 --tokens-per-sample 512 --weight-decay 0.01 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 2000 --save-interval-updates 50000 --keep-interval-updates 50000 --update-freq 1 --seed 1 --save-dir /path/to/checkpoint_dir --fp16 --warmup-updates 10000 --schemes [\"pair_span\"] --distributed-port 12580 --distributed-world-size 32 --span-lower 1 --span-upper 10 --validate-interval 1 --clip-norm 1.0 --geometric-p 0.2 --adam-eps 1e-8 --short-seq-prob 0.0 --replacement-method span --clamp-attention --no-nsp --pair-loss-weight 1.0 --max-pair-targets 15 --pair-positional-embedding-size 200 --endpoints external

I got this error: File "train.py", line 381, in distributed_main(args) File "/home/user/spanBertReform/pretraining/distributed_train.py", line 37, in main args.distributed_rank = distributed_utils.distributed_init(args) File "/home/user/spanBertReform/pretraining/fairseq/distributed_utils.py", line 65, in distributed_init rank=args.distributed_rank, File "/home/user/anaconda3/envs/spanBertReform/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 406, in init_process_group store, rank, world_size = next(rendezvous(url)) File "/home/user/anaconda3/envs/spanBertReform/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 130, in _env_rendezvous_handler raise _env_error("MASTER_ADDR") ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set

How can I Resolve this ?

3 Any recommendation to run this script without using P3?(my current machine only has 2X RTX 2080ti only :( )

muiPomeranian commented 4 years ago

for 3), https://github.com/NVIDIA/apex/issues/99 it suggests to use python -m torch.distributed.launch train.py --arg..... but it complains about: error: unrecognized arguments: --local_rank=0

pritamqu commented 3 years ago

any solution on this thread.. I am facing the same problem...

houliangxue commented 3 years ago

I am also a newbie，facing the same problem .. so the data preprocess is suitable for Wiki_dump ? Thanks! @muiPomeranian

facebookresearch / SpanBERT

training from scratch #45