allenai / dont-stop-pretraining

Code associated with the Don't Stop Pretraining ACL 2020 paper
525 stars 73 forks source link

Fail to reproduce the work #32

Open muyuhuatang opened 3 years ago

muyuhuatang commented 3 years ago

Could you please check the implementation steps you provided in the README file?

I followed your instructions but find it very hard to reproduce this work, someerrors would come out like version inconsistency between allennlp and transformers, then lead to error like:

subprocess.CalledProcessError: Command 'allennlp train training_config/classifier.jsonnet --include-package dont_stop_pretraining -s model_logs\citation_intent_base' returned non-zero exit status 1.

Or just there are some wrong steps during my implementation? It is really confusing and frustrating.

muyuhuatang commented 3 years ago

May I ask what is the allennlp version in this project? I tried 2.2.0 and 0.9.0, but all lead to errors.

coxep commented 3 years ago

I tried using the pinned version (specified in environment.yml), and that also failed with the error shared above. Please provide a working environment.yml.

gmarcial44 commented 3 years ago

I think there might be an issue with the datasets that are publicly available?

ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

wise-east commented 2 years ago

@gmarcial44 are you using the latest-allennlp branch? if so, I was able to get around this issue by replacing the environments.datasets.py file with the following:

NER_DATASETS = {
    "ncbi": {
        "data_dir": "/home/suching/scibert/data/ner/NCBI-disease/",
    },
    "sciie": {
        "data_dir": "/home/suching/scibert/data/ner/sciie/"
    },
    "jnlpba": {
        "data_dir": "/home/suching/scibert/data/ner/JNLPBA/"
    },
    "bc5cdr": {
        "data_dir": "/home/suching/scibert/data/ner/bc5cdr/"
    }
}

CLASSIFICATION_DATASETS = {
    "chemprot": {
        "data_dir": "https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/chemprot/",
        "dataset_size": 4169
    },
    "rct-20k": {
        "data_dir": "https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/rct-20k/",
        "dataset_size": 180040
    },
    "rct-sample": {
        "data_dir": "https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/rct-sample/",
        "dataset_size": 500
    },
    "citation_intent": {
        "data_dir": "https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/",
        "dataset_size": 1688
    },
    "sciie": {
        "data_dir": "https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/sciie/",
        "dataset_size": 3219
    },
    "ag": {
        "data_dir": "https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/ag/",
        "dataset_size": 115000
    },
    "hyperpartisan_news": {
        "data_dir": "https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/hyperpartisan_news/",
        "dataset_size": 500
    },
    "imdb": {
        "data_dir": "https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/imdb/",
        "dataset_size": 20000
    },
    "amazon": {
        "data_dir": "https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/amazon/",
        "dataset_size": 115251
    }
}

DATASETS = {"NER": NER_DATASETS, "CLASSIFICATION": CLASSIFICATION_DATASETS}