Using the pre-trained model

s3xton commented 5 years ago

Hey,

Sorry if this overlaps with #4 , but Is it necessary to generate the entire dataset to make use of the pre-trained model?

I'm building a QA system and I'm wondering if it's possible to take the pre-trained model and some supporting documentation not included in the existing dataset and generate answers to input qeustions against that documentation in real time.

yjernite commented 5 years ago

Hello,

You do not need to download the full dataset to use the pre-trained model, however you will need to process your data using our scripts so that they are in the right format. This includes:

SpaCy tokenization and our URL normalization (word_url_tokenize in data_utils.py)
Special token '\<P>' to separate paragraphs
Apply BPE and pre-process for Fairseq

Good luck!

s3xton commented 5 years ago

Hey,

Thanks for getting back to me so fast. I'm glad that its possible, this is really cool work.

So I've tried running the below command as part of the test_model_code_scripts.sh but I get the below error:

cat testing_files/output_for_multitask_bpe.txt | python ~/fairseq/interactive.py ~/fairseq/data-bin/eli5_data --path multitask_checkpoint.pt  --task translation --batch-size 16 --nbest 1 --beam 5 --source-lang multitask_source_bpe --target-lang multitask_target_bpe --nbest 1 --prefix-size 0 --remove-bpe --max-len-b 500 --max-len-a 0 --min-len 250 --buffer-size 1 --batch-size 1 --no-repeat-ngram-size 3

FileNotFoundError: [Errno 2] No such file or directory: '/root/fairseq/data-bin/eli5_data/dict.multitask_source_bpe.txt'

I see that those dict files are generated as part of generating the dataset, how should you generate them without generating the whole dataset?

I tried to generate these myself using just my own simplistic json dataset files and the below script but I'm not sure if this is the right way to go about it:

OUTPUT_PATH=formatted_files
PATH_TO_DATA=processed_data
mkdir OUTPUT_PATH
python process_data_to_source_target.py --input $PATH_TO_DATA --output $OUTPUT_PATH

subword-nmt apply-bpe -c bpe_codes.txt < formatted_files/train.multitask_source > formatted_files/train.multitask_source_bpe
subword-nmt apply-bpe -c bpe_codes.txt < formatted_files/test.multitask_source > formatted_files/test.multitask_source_bpe
subword-nmt apply-bpe -c bpe_codes.txt < formatted_files/valid.multitask_source > formatted_files/valid.multitask_source_bpe
subword-nmt apply-bpe -c bpe_codes.txt < formatted_files/train.multitask_target > formatted_files/train.multitask_target_bpe
subword-nmt apply-bpe -c bpe_codes.txt < formatted_files/test.multitask_target > formatted_files/test.multitask_target_bpe
subword-nmt apply-bpe -c bpe_codes.txt < formatted_files/valid.multitask_target > formatted_files/valid.multitask_target_bpe

cd ~/fairseq
TEXT=/workspace/ELI5/model_code/formatted_files
python preprocess.py --source-lang multitask_source_bpe --target-lang multitask_target_bpe \
   --validpref $TEXT/valid --testpref $TEXT/test --trainpref $TEXT/train --destdir data-bin/eli5

as then I get the same error as in #10 with size mismatches:

RuntimeError: Error(s) in loading state_dict for TransformerModel:
    size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([52712, 1024]) from checkpoint, the shape in current model is torch.Size([168, 1024]).
    size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([52864, 1024]) from checkpoint, the shape in current model is torch.Size([160, 1024]).

Thanks for the help!

huihuifan commented 5 years ago

Hello, I will upload the dictionary file and add it to the readme.

On Mon, Sep 16, 2019 at 16:51 Conor Sexton notifications@github.com wrote:

Hey,

Thanks for getting back to me so fast. I'm glad that its possible, this is really cool work.

So I've tried running the below command as part of the test_model_code_scripts.sh but I get the below error:

cat testing_files/output_for_multitask_bpe.txt | python ~/fairseq/interactive.py ~/fairseq/data-bin/eli5_data --path multitask_checkpoint.pt --task translation --batch-size 16 --nbest 1 --beam 5 --source-lang multitask_source_bpe --target-lang multitask_target_bpe --nbest 1 --prefix-size 0 --remove-bpe --max-len-b 500 --max-len-a 0 --min-len 250 --buffer-size 1 --batch-size 1 --no-repeat-ngram-size 3

FileNotFoundError: [Errno 2] No such file or directory: '/root/fairseq/data-bin/eli5_data/dict.multitask_source_bpe.txt'

I see that those dict files are generated as part of generating the dataset, how should you generate them without generating the whole dataset?

I tried to generate these myself using just dummy json dataset files and the below script but I'm not sure if this is the right way to go about it:

OUTPUT_PATH=formatted_files PATH_TO_DATA=processed_data mkdir OUTPUT_PATH python process_data_to_source_target.py --input $PATH_TO_DATA --output $OUTPUT_PATH

subword-nmt apply-bpe -c bpe_codes.txt < formatted_files/train.multitask_source > formatted_files/train.multitask_source_bpe subword-nmt apply-bpe -c bpe_codes.txt < formatted_files/test.multitask_source > formatted_files/test.multitask_source_bpe subword-nmt apply-bpe -c bpe_codes.txt < formatted_files/valid.multitask_source > formatted_files/valid.multitask_source_bpe subword-nmt apply-bpe -c bpe_codes.txt < formatted_files/train.multitask_target > formatted_files/train.multitask_target_bpe subword-nmt apply-bpe -c bpe_codes.txt < formatted_files/test.multitask_target > formatted_files/test.multitask_target_bpe subword-nmt apply-bpe -c bpe_codes.txt < formatted_files/valid.multitask_target > formatted_files/valid.multitask_target_bpe cd ~/fairseq TEXT=/workspace/ELI5/model_code/formatted_files python preprocess.py --source-lang multitask_source_bpe --target-lang multitask_target_bpe \ --validpref $TEXT/valid --testpref $TEXT/test --trainpref $TEXT/train --destdir data-bin/eli5

as then I get the same error as in #10 https://github.com/facebookresearch/ELI5/issues/10 with size mismatches:

RuntimeError: Error(s) in loading state_dict for TransformerModel: size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([52712, 1024]) from checkpoint, the shape in current model is torch.Size([168, 1024]). size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([52864, 1024]) from checkpoint, the shape in current model is torch.Size([160, 1024]).

Thanks for the help!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/ELI5/issues/11?email_source=notifications&email_token=AAYBG7NSOLGRWPRZBE7R2JDQJ6MQXA5CNFSM4IXBSTP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ZNE6A#issuecomment-531812984, or mute the thread https://github.com/notifications/unsubscribe-auth/AAYBG7N4ZWEAX4TBE4GFIKTQJ6MQXANCNFSM4IXBSTPQ .

huihuifan commented 5 years ago

Just added in the model_code directory, and modified the readme to include the parameters if you use fairseq-py. See 64adbbdda26cdc2c9dc30d1e3d7f212dc9d7d901

facebookresearch / ELI5

Using the pre-trained model #11