IBM / low-resource-text-classification-framework

Research framework for low resource text classification that allows the user to experiment with classification models and active learning strategies on a large number of sentence classification datasets, and to simulate real-world scenarios. The framework is easily expandable to new classification models, active learning strategies and datasets.
Apache License 2.0
98 stars 20 forks source link

Unable to execute example in README due to missing directory. #5

Closed leosouliotis closed 3 years ago

leosouliotis commented 3 years ago

Hello,

I am trying to run the example on the README file but got the following error:

Traceback (most recent call last):
  File "/opt/scp/software/Miniconda3/4.7.12.1/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/scp/software/Miniconda3/4.7.12.1/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/kpvv542/Projects/training_projects/low-resource-text-classification-framework/lrtc_lib/experiment_runners/experiment_runner_imbalanced_practical.py", line
 185, in <module>
    delete_workspaces=True)
  File "/home/kpvv542/Projects/training_projects/low-resource-text-classification-framework/lrtc_lib/experiment_runners/experiment_runner.py", line 88, in run
    res_dict = self.train_first_model(config=config)
  File "/home/kpvv542/Projects/training_projects/low-resource-text-classification-framework/lrtc_lib/experiment_runners/experiment_runner.py", line 126, in train_first_
model
    dev_dataset_name=config.dev_dataset_name)
  File "/home/kpvv542/Projects/training_projects/low-resource-text-classification-framework/lrtc_lib/orchestrator/orchestrator_api.py", line 115, in create_workspace   
    orchestrator_state_api.create_workspace(workspace_id, dataset_name, dev_dataset_name, test_dataset_name)
  File "/home/kpvv542/Projects/training_projects/low-resource-text-classification-framework/lrtc_lib/orchestrator/core/state_api/orchestrator_state_api.py", line 60, in
 wrapper
    return func(*a, **k)
  File "/home/kpvv542/Projects/training_projects/low-resource-text-classification-framework/lrtc_lib/orchestrator/core/state_api/orchestrator_state_api.py", line 73, in
 create_workspace
    assert dataset_name in get_all_datasets(), f"Dataset {dataset_name} does not exist, existing datasets are:" \
  File "/home/kpvv542/Projects/training_projects/low-resource-text-classification-framework/lrtc_lib/data_access/loaded_datasets_info.py", line 15, in get_all_datasets 
    return sorted(os.listdir(get_datasets_base_dir()))
FileNotFoundError: [Errno 2] No such file or directory: '/home/kpvv542/Projects/training_projects/low-resource-text-classification-framework/lrtc_lib/data/data_access_d
umps'

All the packages are installed, plus the pandas_access. Any suggestions? Or is it a dependancies issue?

arielge commented 3 years ago

Hi @leosouliotis, the directory should be there as long as the shell script lrtc_lib/download_and_prepare_datasets.sh finished successfully. Can you (re)run the script and attach the output?

leosouliotis commented 3 years ago

Thanks @arielge ! The problem seems to be on the polarity dataset, any thoughts on that?

lrtc_lib/download_and_prepare_datasets.sh
/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data//raw/
** Downloading polarity files **
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed  
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0 
curl: (60) Certificate key usage inadequate for attempted operation.
More details here: http://curl.haxx.se/docs/sslcerts.html

curl performs SSL certificate verification by default, using a "bundle"       
 of Certificate Authority (CA) public keys (CA certs). If the default
 bundle file isn't adequate, you can specify an alternate file
 using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented in
 the bundle, the certificate verification probably failed due to a
 problem with the certificate (it might be expired, or the name might
 not match the domain name in the URL).
If you'd like to turn off curl's verification of the certificate, use
 the -k (or --insecure) option.
tar (child): polarity.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
mv: cannot stat ‘rt-polaritydata’: No such file or directory        
Traceback (most recent call last):
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data//get_by_ids.py", line 66, in <module>
    res_df = extract_data(df, raw_dir, label=label_func)
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data//get_by_ids.py", line 31, in extract_data
    with open(file_path, encoding="iso-8859-1") as fl:
FileNotFoundError: [Errno 2] No such file or directory: '../raw/polarity/rt-polarity.neg'
cp: cannot stat ‘polarity’: No such file or directory
Traceback (most recent call last):
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data//get_by_ids.py", line 66, in <module>    
    res_df = extract_data(df, raw_dir, label=label_func)
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data//get_by_ids.py", line 31, in extract_data
    with open(file_path, encoding="iso-8859-1") as fl:
FileNotFoundError: [Errno 2] No such file or directory: '../raw/polarity_imbalanced_positive/rt-polarity.neg'
rm: cannot remove ‘polarity.tar.gz’: No such file or directory
** Downloading subjectivity files **
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  507k  100  507k    0     0  65.2M      0 --:--:-- --:--:-- --:--:-- 70.7M
** Downloading AG news files **
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   383    0   383    0     0    229      0 --:--:--  0:00:01 --:--:--   229
  0     0    0 11.2M    0     0  2447k      0 --:--:--  0:00:04 --:--:-- 5367k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   403    0   403    0     0    255      0 --:--:--  0:00:01 --:--:--   255
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0
  0     0    0 11.2M    0     0  2320k      0 --:--:--  0:00:04 --:--:-- 5158k
looking for /home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data/raw/ag_news/train.csv
converted 120000 lines from ./ag_news/train.csv
looking for /home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data/raw/ag_news/test.csv
converted 7600 lines from ./ag_news/test.csv
** Downloading wiki attack files **
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  100M  100  100M    0     0   391k      0  0:04:22  0:04:22 --:--:--  312k
Archive:  wiki_attack.zip
 extracting: ./wiki_attack/attack_annotated_comments.tsv  
 extracting: ./wiki_attack/attack_annotations.tsv  
 extracting: ./wiki_attack/attack_worker_demographics.tsv  
train size = 69526
dev size = 23160
test size = 23178
** Downloading TREC files **
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  327k  100  327k    0     0   163k      0  0:00:02  0:00:02 --:--:--  163k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 23354  100 23354    0     0  27341      0 --:--:-- --:--:-- --:--:-- 27314
** Downloading CoLA files **
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  249k  100  249k    0     0   368k      0 --:--:-- --:--:-- --:--:--  368k
Archive:  cola.zip
   creating: ./cola/cola_public/
  inflating: ./cola/cola_public/README  
   creating: ./cola/cola_public/tokenized/
  inflating: ./cola/cola_public/tokenized/in_domain_dev.tsv  
  inflating: ./cola/cola_public/tokenized/in_domain_train.tsv
  inflating: ./cola/cola_public/tokenized/out_of_domain_dev.tsv
   creating: ./cola/cola_public/raw/
  inflating: ./cola/cola_public/raw/in_domain_dev.tsv  
  inflating: ./cola/cola_public/raw/in_domain_train.tsv
  inflating: ./cola/cola_public/raw/out_of_domain_dev.tsv
** Downloading ISEAR files **
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 1075k  100 1075k    0     0   214k      0  0:00:05  0:00:05 --:--:--  277k
Archive:  isear.zip
  inflating: ./isear/ISEAR Questionnaire & Codebook.doc  
  inflating: ./isear/ISEAR SPSS Databank.zip  
  inflating: ./isear/isear.html      
  inflating: ./isear/isear_databank.zip  
Archive:  isear_databank.zip
  inflating: isear_databank.mdb

********************************************************************************************************
****** Loading the ISEAR dataset requires special dependencies.                                *********
****** On Mac/Linux, install https://github.com/mdbtools/mdbtools and `pip install pandas_access`, *****
****** and then rerun the main script.                                                             *****
********************************************************************************************************

2021-06-07 13:49:49.627248: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: 
cannot open shared object file: No such file or directory
2021-06-07 13:49:49.627431: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer
_plugin.so.6: cannot open shared object file: No such file or directory
2021-06-07 13:49:49.627443: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with Tens
orRT, please make sure the missing libraries mentioned above are installed properly.
Traceback (most recent call last):
  File "<string>", line 3, in <module>
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data/load_dataset.py", line 21, in load
    single_dataset_loader.load_dataset(dataset_name, force_new)
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/single_dataset_loader.py", line 38, in load_dataset
    data_processor: DataProcessorAPI = processor_factory.get_data_processor(dataset_name)
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/processors/data_processor_factory.py", line 34, in get_data_processor    
    return PolarityProcessor(dataset_part=dataset_part)
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/processors/process_polarity_data.py", line 16, in __init__
    super().__init__(dataset_name='polarity'+imbalanced_postfix, dataset_part=dataset_part, encoding='latin-1')
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/processors/process_csv_data.py", line 62, in __init__
    self._process()
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/processors/process_csv_data.py", line 99, in _process
    raise Exception(f'{self.dataset_part.name.lower()} set file for dataset "{self.dataset_name}" not found')
Exception: train set file for dataset "polarity" not found
arielge commented 3 years ago

OK, I am not sure what the cause of the curl issue is, but my guess is that if you edit the address in line 22 of the shell script from https to http (so curl http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz -o polarity.tar.gz) it would do the trick. If not, you can just manually download the file in the above link and place it under lrtc_lib/data/raw/polarity.tar.gz and the rest of the script should run smoothly.

leosouliotis commented 3 years ago

Maybe it was an issue with Cornell, but this seems to solve the problem! Thanks!