Closed leosouliotis closed 3 years ago
Hi @leosouliotis, the directory should be there as long as the shell script lrtc_lib/download_and_prepare_datasets.sh
finished successfully. Can you (re)run the script and attach the output?
Thanks @arielge ! The problem seems to be on the polarity dataset, any thoughts on that?
lrtc_lib/download_and_prepare_datasets.sh
/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data//raw/
** Downloading polarity files **
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
curl: (60) Certificate key usage inadequate for attempted operation.
More details here: http://curl.haxx.se/docs/sslcerts.html
curl performs SSL certificate verification by default, using a "bundle"
of Certificate Authority (CA) public keys (CA certs). If the default
bundle file isn't adequate, you can specify an alternate file
using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented in
the bundle, the certificate verification probably failed due to a
problem with the certificate (it might be expired, or the name might
not match the domain name in the URL).
If you'd like to turn off curl's verification of the certificate, use
the -k (or --insecure) option.
tar (child): polarity.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
mv: cannot stat ‘rt-polaritydata’: No such file or directory
Traceback (most recent call last):
File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data//get_by_ids.py", line 66, in <module>
res_df = extract_data(df, raw_dir, label=label_func)
File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data//get_by_ids.py", line 31, in extract_data
with open(file_path, encoding="iso-8859-1") as fl:
FileNotFoundError: [Errno 2] No such file or directory: '../raw/polarity/rt-polarity.neg'
cp: cannot stat ‘polarity’: No such file or directory
Traceback (most recent call last):
File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data//get_by_ids.py", line 66, in <module>
res_df = extract_data(df, raw_dir, label=label_func)
File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data//get_by_ids.py", line 31, in extract_data
with open(file_path, encoding="iso-8859-1") as fl:
FileNotFoundError: [Errno 2] No such file or directory: '../raw/polarity_imbalanced_positive/rt-polarity.neg'
rm: cannot remove ‘polarity.tar.gz’: No such file or directory
** Downloading subjectivity files **
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 507k 100 507k 0 0 65.2M 0 --:--:-- --:--:-- --:--:-- 70.7M
** Downloading AG news files **
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 383 0 383 0 0 229 0 --:--:-- 0:00:01 --:--:-- 229
0 0 0 11.2M 0 0 2447k 0 --:--:-- 0:00:04 --:--:-- 5367k
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 403 0 403 0 0 255 0 --:--:-- 0:00:01 --:--:-- 255
0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0
0 0 0 11.2M 0 0 2320k 0 --:--:-- 0:00:04 --:--:-- 5158k
looking for /home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data/raw/ag_news/train.csv
converted 120000 lines from ./ag_news/train.csv
looking for /home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data/raw/ag_news/test.csv
converted 7600 lines from ./ag_news/test.csv
** Downloading wiki attack files **
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 100M 100 100M 0 0 391k 0 0:04:22 0:04:22 --:--:-- 312k
Archive: wiki_attack.zip
extracting: ./wiki_attack/attack_annotated_comments.tsv
extracting: ./wiki_attack/attack_annotations.tsv
extracting: ./wiki_attack/attack_worker_demographics.tsv
train size = 69526
dev size = 23160
test size = 23178
** Downloading TREC files **
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 327k 100 327k 0 0 163k 0 0:00:02 0:00:02 --:--:-- 163k
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 23354 100 23354 0 0 27341 0 --:--:-- --:--:-- --:--:-- 27314
** Downloading CoLA files **
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 249k 100 249k 0 0 368k 0 --:--:-- --:--:-- --:--:-- 368k
Archive: cola.zip
creating: ./cola/cola_public/
inflating: ./cola/cola_public/README
creating: ./cola/cola_public/tokenized/
inflating: ./cola/cola_public/tokenized/in_domain_dev.tsv
inflating: ./cola/cola_public/tokenized/in_domain_train.tsv
inflating: ./cola/cola_public/tokenized/out_of_domain_dev.tsv
creating: ./cola/cola_public/raw/
inflating: ./cola/cola_public/raw/in_domain_dev.tsv
inflating: ./cola/cola_public/raw/in_domain_train.tsv
inflating: ./cola/cola_public/raw/out_of_domain_dev.tsv
** Downloading ISEAR files **
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 1075k 100 1075k 0 0 214k 0 0:00:05 0:00:05 --:--:-- 277k
Archive: isear.zip
inflating: ./isear/ISEAR Questionnaire & Codebook.doc
inflating: ./isear/ISEAR SPSS Databank.zip
inflating: ./isear/isear.html
inflating: ./isear/isear_databank.zip
Archive: isear_databank.zip
inflating: isear_databank.mdb
********************************************************************************************************
****** Loading the ISEAR dataset requires special dependencies. *********
****** On Mac/Linux, install https://github.com/mdbtools/mdbtools and `pip install pandas_access`, *****
****** and then rerun the main script. *****
********************************************************************************************************
2021-06-07 13:49:49.627248: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6:
cannot open shared object file: No such file or directory
2021-06-07 13:49:49.627431: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer
_plugin.so.6: cannot open shared object file: No such file or directory
2021-06-07 13:49:49.627443: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with Tens
orRT, please make sure the missing libraries mentioned above are installed properly.
Traceback (most recent call last):
File "<string>", line 3, in <module>
File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data/load_dataset.py", line 21, in load
single_dataset_loader.load_dataset(dataset_name, force_new)
File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/single_dataset_loader.py", line 38, in load_dataset
data_processor: DataProcessorAPI = processor_factory.get_data_processor(dataset_name)
File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/processors/data_processor_factory.py", line 34, in get_data_processor
return PolarityProcessor(dataset_part=dataset_part)
File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/processors/process_polarity_data.py", line 16, in __init__
super().__init__(dataset_name='polarity'+imbalanced_postfix, dataset_part=dataset_part, encoding='latin-1')
File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/processors/process_csv_data.py", line 62, in __init__
self._process()
File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/processors/process_csv_data.py", line 99, in _process
raise Exception(f'{self.dataset_part.name.lower()} set file for dataset "{self.dataset_name}" not found')
Exception: train set file for dataset "polarity" not found
OK, I am not sure what the cause of the curl issue is, but my guess is that if you edit the address in line 22 of the shell script from https to http (so curl http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz -o polarity.tar.gz
) it would do the trick.
If not, you can just manually download the file in the above link and place it under lrtc_lib/data/raw/polarity.tar.gz
and the rest of the script should run smoothly.
Maybe it was an issue with Cornell, but this seems to solve the problem! Thanks!
Hello,
I am trying to run the example on the
README
file but got the following error:All the packages are installed, plus the
pandas_access
. Any suggestions? Or is it a dependancies issue?