alontalmor / MultiQA

138 stars 23 forks source link

SQUAD conversion code for new dataset #4

Closed isaacmg closed 5 years ago

isaacmg commented 5 years ago

Hi and thanks for the paper and the code. I was wondering if you have the code that you used to originally convert SQUAD to the format in MultiQA? I have another dataset (emrQA) in SQUAD format and would like to convert to your format for training on MultiQA. I checked preprocess.py but I assume that is preprocessing once it is already converted.

Thank you.

alontalmor commented 5 years ago

Hi and Thanks for using the project, The code can be found here https://github.com/alontalmor/MultiQA/blob/master/datasets/SQuAD/squad.py (you can see the original URL of the SQuAD dataset inside) and you can rebuild it using: python build_dataset.py --dataset_name SQuAD --split dev --output_file path/to/output.jsonl.gz --n_processes 10

Does this answer your question?

isaacmg commented 5 years ago

Yes it does. I will try building it and let you know if I have problems.

Thank you.

isaacmg commented 5 years ago

Hi so I was trying to train on emrQA and I ran into the following issue

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/run.py", line 21, in <module>
    run()
  File "/usr/local/lib/python3.6/dist-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/__init__.py", line 102, in main
    args.func(args)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 116, in train_model_from_args
    args.cache_prefix)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 160, in train_model_from_file
    cache_directory, cache_prefix)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 218, in train_model
    cache_prefix)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/training/trainer.py", line 788, in from_params
    (instance for key, dataset in all_datasets.items()
  File "/usr/local/lib/python3.6/dist-packages/allennlp/data/vocabulary.py", line 487, in from_params
    min_pretrained_embeddings=min_pretrained_embeddings)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/data/vocabulary.py", line 398, in from_instances
    for instance in Tqdm.tqdm(instances):
  File "/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py", line 979, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.6/dist-packages/allennlp/training/trainer.py", line 789, in <genexpr>
    for instance in dataset
  File "/usr/local/lib/python3.6/dist-packages/allennlp/data/dataset_readers/dataset_reader.py", line 49, in __iter__
    yield from instances
  File "/content/MultiQA/models/multiqa_reader.py", line 416, in _read
    single_file_path_cached = cached_path(single_file_path)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/common/file_utils.py", line 98, in cached_path
    return get_from_cache(url_or_filename, cache_dir)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/common/file_utils.py", line 194, in get_from_cache
    etag = s3_etag(url)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/common/file_utils.py", line 142, in wrapper
    return func(url, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/common/file_utils.py", line 158, in s3_etag
    return s3_object.e_tag
  File "/usr/local/lib/python3.6/dist-packages/boto3/resources/factory.py", line 339, in property_loader
    self.load()
  File "/usr/local/lib/python3.6/dist-packages/boto3/resources/factory.py", line 505, in do_action
    response = action(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/boto3/resources/action.py", line 83, in __call__
    response = getattr(parent.meta.client, operation_name)(**params)
  File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 648, in _make_api_call
    operation_model, request_dict, request_context)
  File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 667, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 102, in make_request
    return self._send_request(request_dict, operation_model)
  File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 132, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 116, in create_request
    operation_name=operation_model.name)
  File "/usr/local/lib/python3.6/dist-packages/botocore/hooks.py", line 356, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/botocore/hooks.py", line 228, in emit
    return self._emit(event_name, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/botocore/hooks.py", line 211, in _emit
    response = handler(**kwargs)
  File "/usr/local/lib/python3.6/dist-packages/botocore/signers.py", line 90, in handler
    return self.sign(operation_name, request)
  File "/usr/local/lib/python3.6/dist-packages/botocore/signers.py", line 157, in sign
    auth.add_auth(request)
  File "/usr/local/lib/python3.6/dist-packages/botocore/auth.py", line 425, in add_auth
    super(S3SigV4Auth, self).add_auth(request)
  File "/usr/local/lib/python3.6/dist-packages/botocore/auth.py", line 357, in add_auth
    raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials

I honestly don't understand why it would even be trying to call Boto as all file should be local

alontalmor commented 5 years ago

Hi,

Yes, if you customary build a dataset right now it automatically tries to upload it to the s3 location of the new dataset (boto is needed for that)

I will change this default behavior.

Thanks for notifying me! Alon

On Wed, Aug 14, 2019 at 8:33 PM isaacmg notifications@github.com wrote:

Hi so I was trying to train on emrQA and I ran into the following issue

Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.6/dist-packages/allennlp/run.py", line 21, in run() File "/usr/local/lib/python3.6/dist-packages/allennlp/run.py", line 18, in run main(prog="allennlp") File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/init.py", line 102, in main args.func(args) File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 116, in train_model_from_args args.cache_prefix) File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 160, in train_model_from_file cache_directory, cache_prefix) File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 218, in train_model cache_prefix) File "/usr/local/lib/python3.6/dist-packages/allennlp/training/trainer.py", line 788, in from_params (instance for key, dataset in all_datasets.items() File "/usr/local/lib/python3.6/dist-packages/allennlp/data/vocabulary.py", line 487, in from_params min_pretrained_embeddings=min_pretrained_embeddings) File "/usr/local/lib/python3.6/dist-packages/allennlp/data/vocabulary.py", line 398, in from_instances for instance in Tqdm.tqdm(instances): File "/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py", line 979, in iter for obj in iterable: File "/usr/local/lib/python3.6/dist-packages/allennlp/training/trainer.py", line 789, in for instance in dataset File "/usr/local/lib/python3.6/dist-packages/allennlp/data/dataset_readers/dataset_reader.py", line 49, in iter yield from instances File "/content/MultiQA/models/multiqa_reader.py", line 416, in _read single_file_path_cached = cached_path(single_file_path) File "/usr/local/lib/python3.6/dist-packages/allennlp/common/file_utils.py", line 98, in cached_path return get_from_cache(url_or_filename, cache_dir) File "/usr/local/lib/python3.6/dist-packages/allennlp/common/file_utils.py", line 194, in get_from_cache etag = s3_etag(url) File "/usr/local/lib/python3.6/dist-packages/allennlp/common/file_utils.py", line 142, in wrapper return func(url, *args, *kwargs) File "/usr/local/lib/python3.6/dist-packages/allennlp/common/file_utils.py", line 158, in s3_etag return s3_object.e_tag File "/usr/local/lib/python3.6/dist-packages/boto3/resources/factory.py", line 339, in property_loader self.load() File "/usr/local/lib/python3.6/dist-packages/boto3/resources/factory.py", line 505, in do_action response = action(self, args, kwargs) File "/usr/local/lib/python3.6/dist-packages/boto3/resources/action.py", line 83, in call response = getattr(parent.meta.client, operation_name)(params) File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 357, in _api_call return self._make_api_call(operation_name, kwargs) File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 648, in _make_api_call operation_model, request_dict, request_context) File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 667, in _make_request return self._endpoint.make_request(operation_model, request_dict) File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 102, in make_request return self._send_request(request_dict, operation_model) File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 132, in _send_request request = self.create_request(request_dict, operation_model) File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 116, in create_request operation_name=operation_model.name) File "/usr/local/lib/python3.6/dist-packages/botocore/hooks.py", line 356, in emit return self._emitter.emit(aliased_event_name, kwargs) File "/usr/local/lib/python3.6/dist-packages/botocore/hooks.py", line 228, in emit return self._emit(event_name, kwargs) File "/usr/local/lib/python3.6/dist-packages/botocore/hooks.py", line 211, in _emit response = handler(kwargs) File "/usr/local/lib/python3.6/dist-packages/botocore/signers.py", line 90, in handler return self.sign(operation_name, request) File "/usr/local/lib/python3.6/dist-packages/botocore/signers.py", line 157, in sign auth.add_auth(request) File "/usr/local/lib/python3.6/dist-packages/botocore/auth.py", line 425, in add_auth super(S3SigV4Auth, self).add_auth(request) File "/usr/local/lib/python3.6/dist-packages/botocore/auth.py", line 357, in add_auth raise NoCredentialsError botocore.exceptions.NoCredentialsError: Unable to locate credentials``` I honestly don't understand why it would even be trying to call Boto as all file should be local

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alontalmor/MultiQA/issues/4?email_source=notifications&email_token=ACUIPDMHKV6UU4A5MV23GHTQETFCBA5CNFSM4IJMNSF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4KXT7A#issuecomment-521501180, or mute the thread https://github.com/notifications/unsubscribe-auth/ACUIPDMZKFKYBNDUTFFHKOLQETFCBANCNFSM4IJMNSFQ .

alontalmor commented 5 years ago

Can you please send me the command you are running?

(If the output_file is does not contain s3:// boto3 should be currently required. so i'd like to re-run your command)

On Wed, Aug 14, 2019 at 8:33 PM isaacmg notifications@github.com wrote:

Hi so I was trying to train on emrQA and I ran into the following issue

Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.6/dist-packages/allennlp/run.py", line 21, in run() File "/usr/local/lib/python3.6/dist-packages/allennlp/run.py", line 18, in run main(prog="allennlp") File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/init.py", line 102, in main args.func(args) File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 116, in train_model_from_args args.cache_prefix) File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 160, in train_model_from_file cache_directory, cache_prefix) File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 218, in train_model cache_prefix) File "/usr/local/lib/python3.6/dist-packages/allennlp/training/trainer.py", line 788, in from_params (instance for key, dataset in all_datasets.items() File "/usr/local/lib/python3.6/dist-packages/allennlp/data/vocabulary.py", line 487, in from_params min_pretrained_embeddings=min_pretrained_embeddings) File "/usr/local/lib/python3.6/dist-packages/allennlp/data/vocabulary.py", line 398, in from_instances for instance in Tqdm.tqdm(instances): File "/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py", line 979, in iter for obj in iterable: File "/usr/local/lib/python3.6/dist-packages/allennlp/training/trainer.py", line 789, in for instance in dataset File "/usr/local/lib/python3.6/dist-packages/allennlp/data/dataset_readers/dataset_reader.py", line 49, in iter yield from instances File "/content/MultiQA/models/multiqa_reader.py", line 416, in _read single_file_path_cached = cached_path(single_file_path) File "/usr/local/lib/python3.6/dist-packages/allennlp/common/file_utils.py", line 98, in cached_path return get_from_cache(url_or_filename, cache_dir) File "/usr/local/lib/python3.6/dist-packages/allennlp/common/file_utils.py", line 194, in get_from_cache etag = s3_etag(url) File "/usr/local/lib/python3.6/dist-packages/allennlp/common/file_utils.py", line 142, in wrapper return func(url, *args, *kwargs) File "/usr/local/lib/python3.6/dist-packages/allennlp/common/file_utils.py", line 158, in s3_etag return s3_object.e_tag File "/usr/local/lib/python3.6/dist-packages/boto3/resources/factory.py", line 339, in property_loader self.load() File "/usr/local/lib/python3.6/dist-packages/boto3/resources/factory.py", line 505, in do_action response = action(self, args, kwargs) File "/usr/local/lib/python3.6/dist-packages/boto3/resources/action.py", line 83, in call response = getattr(parent.meta.client, operation_name)(params) File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 357, in _api_call return self._make_api_call(operation_name, kwargs) File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 648, in _make_api_call operation_model, request_dict, request_context) File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 667, in _make_request return self._endpoint.make_request(operation_model, request_dict) File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 102, in make_request return self._send_request(request_dict, operation_model) File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 132, in _send_request request = self.create_request(request_dict, operation_model) File "/usr/local/lib/python3.6/dist-packages/botocore/endpoint.py", line 116, in create_request operation_name=operation_model.name) File "/usr/local/lib/python3.6/dist-packages/botocore/hooks.py", line 356, in emit return self._emitter.emit(aliased_event_name, kwargs) File "/usr/local/lib/python3.6/dist-packages/botocore/hooks.py", line 228, in emit return self._emit(event_name, kwargs) File "/usr/local/lib/python3.6/dist-packages/botocore/hooks.py", line 211, in _emit response = handler(kwargs) File "/usr/local/lib/python3.6/dist-packages/botocore/signers.py", line 90, in handler return self.sign(operation_name, request) File "/usr/local/lib/python3.6/dist-packages/botocore/signers.py", line 157, in sign auth.add_auth(request) File "/usr/local/lib/python3.6/dist-packages/botocore/auth.py", line 425, in add_auth super(S3SigV4Auth, self).add_auth(request) File "/usr/local/lib/python3.6/dist-packages/botocore/auth.py", line 357, in add_auth raise NoCredentialsError botocore.exceptions.NoCredentialsError: Unable to locate credentials``` I honestly don't understand why it would even be trying to call Boto as all file should be local

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alontalmor/MultiQA/issues/4?email_source=notifications&email_token=ACUIPDMHKV6UU4A5MV23GHTQETFCBA5CNFSM4IJMNSF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4KXT7A#issuecomment-521501180, or mute the thread https://github.com/notifications/unsubscribe-auth/ACUIPDMZKFKYBNDUTFFHKOLQETFCBANCNFSM4IJMNSFQ .

isaacmg commented 5 years ago

This is my command python -m allennlp.run train models/MultiQA_BERTBase.jsonnet -s Results/emrQA/Train1000 -o "{'train_data_path': 'emrQA.jsonl.gz', 'trainer': {'cuda_device': -1, 'optimizer': {'t_total': 29000}}}" --include-package models

For context I'm training on emrQA there is still a few other problems I'm sorting out regarding getting it in Squad format (I changed a couple things in multiqa_reader) but you can see all my current changes in my fork of the repo.

alontalmor commented 5 years ago

I see, so you get the boto3 error when training? With the command you sent that should not be happening. (You are not referring to s3 I. This command)

Also the MultiQA format is different from the SQuAD format, the exact details are here: https://github.com/alontalmor/MultiQA/blob/master/datasets/README.md

On Thu, 15 Aug 2019 at 8:52 isaacmg notifications@github.com wrote:

This is my command python -m allennlp.run train models/MultiQA_BERTBase.jsonnet -s Results/emrQA/Train1000 -o "{'train_data_path': 'emrQA.jsonl.gz', 'trainer': {'cuda_device': -1, 'optimizer': {'t_total': 29000}}}" --include-package models

For context I'm training on emrQA https://github.com/panushri25/emrQA there is still a few other problems I'm sorting out regarding getting it in Squad format (I changed a couple things in multiqa_reader) but you can see all my current changes in my fork https://github.com/isaacmg/MultiQA of the repo.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alontalmor/MultiQA/issues/4?email_source=notifications&email_token=ACUIPDJHXRJDKSMZU5DZVIDQEV3TRA5CNFSM4IJMNSF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4MGH2Q#issuecomment-521692138, or mute the thread https://github.com/notifications/unsubscribe-auth/ACUIPDMVVAYBB6OJGLS3WSTQEV3TRANCNFSM4IJMNSFQ .

isaacmg commented 5 years ago

Yes when running that training command. I can't figure out why exactly. Based on debugging and the message it appears to be line single_file_path_cached = cached_path(single_file_path) in _read in multiqa_reader.py that is throwing the error. Do you need too supply a validation path in the jsonnet? Because that could be the problem.

On the other point I'm converting it to the MultiQA format but I'm trying to use the SQUAD build code as a template. Unfortunately the emrQA format has a few subtle differences from SQUAD which I'm just now realizing. So I'm working on fixing those errors.

alontalmor commented 5 years ago

Now i see what's going on.

Yes you need to override the validation_data_path, otherwise it will use the default which is "s3://multiqa/data/SQuAD1-1_dev.jsonl.gz" I will change the default to be a url and not s3, but still i think you don't want the default.

On Thu, Aug 15, 2019 at 10:09 AM isaacmg notifications@github.com wrote:

Yes when running that training command. I can't figure out why exactly. Based on debugging and the message it appears to be line single_file_path_cached = cached_path(single_file_path) in _read in multiqa_reader.py that is throwing the error. Do you need too supply a validation path in the jsonnet? Because that could be the problem.

On the other point I'm converting it to the MultiQA format but I'm trying to use the SQUAD build code as a template. Unfortunately the emrQA format has a few subtle differences from SQUAD which I'm just now realizing. So I'm working on fixing those errors.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alontalmor/MultiQA/issues/4?email_source=notifications&email_token=ACUIPDIDZP5O52UZ543SRA3QEWEV7A5CNFSM4IJMNSF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4MM7MQ#issuecomment-521719730, or mute the thread https://github.com/notifications/unsubscribe-auth/ACUIPDMM3ELNZ36MUTXZWXTQEWEV7ANCNFSM4IJMNSFQ .