IBM / low-resource-text-classification-framework

Research framework for low resource text classification that allows the user to experiment with classification models and active learning strategies on a large number of sentence classification datasets, and to simulate real-world scenarios. The framework is easily expandable to new classification models, active learning strategies and datasets.
Apache License 2.0
98 stars 20 forks source link

Issues with custom dataset #6

Closed leosouliotis closed 3 years ago

leosouliotis commented 3 years ago

Hello,

I am trying to implement your AL strategy on a custom dataset. I followed all the steps (with the minimum entry for the minimum steps for the CSV processor) and when I try to run the load_dataset script I get the following:

Traceback (most recent call last):
  File "/home/kpvv542/.conda/envs/kpvv542/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/kpvv542/.conda/envs/kpvv542/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data/load_dataset.py", line 8, in <module>
    from lrtc_lib.data_access import single_dataset_loader
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/single_dataset_loader.py", line 13, in <module>      
    import lrtc_lib.data_access.data_access_factory as data_access_factory
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/data_access_factory.py", line 8, in <module>
    from lrtc_lib.data_access.processors.process_csv_data import CsvProcessor
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/processors/process_csv_data.py", line 13, in <module>
    from lrtc_lib.data_access.processors.data_processor_api import DataProcessorAPI, METADATA_CONTEXT_KEY
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/processors/data_processor_api.py", line 10, in <module>
    import lrtc_lib.orchestrator.orchestrator_api as orchestrator_api
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/orchestrator/orchestrator_api.py", line 23, in <module>
    from lrtc_lib.training_set_selector import training_set_selector_factory
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/training_set_selector/training_set_selector_factory.py", line 6, in <module>
    from lrtc_lib.training_set_selector.train_and_dev_sets_selectors import TrainAndDevSetsSelectorAllLabeled, TrainAndDevSetsSelectorAllLabeledPlusUnlabeledAsWeakNegat
ive
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/training_set_selector/train_and_dev_sets_selectors.py", line 14, in <module>
    data_access = data_access_factory.get_data_access()
AttributeError: partially initialized module 'lrtc_lib.data_access.data_access_factory' has no attribute 'get_data_access' (most likely due to a circular import)

Any suggestions? I don't think these should be something wrong with the dataset

arielge commented 3 years ago

Hi @leosouliotis, looks like a circular import issue. I see from the error message that in line 8 of lrtc_lib/data_access/data_access_factory.py you have an import that was not there in our code, and may be the direct cause of the circular import situation. Do you need this import there (from lrtc_lib.data_access.processors.process_csv_data import CsvProcessor)?

leosouliotis commented 3 years ago

Thanks for the quick response @arielge ! I did delete the this entry from lrtc_lib/data_access/data_access_factory.py (which was left from my previous experimentation, silly me) but then got this error:

Traceback (most recent call last):
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/data_access_factory.py", line
(kpvv542) (lrtc_env) [kpvv542@seskscpn080 low-resource-text-classification-framework]$ python -m lrtc_lib.data.load_datasetTraceback (most recent call last):
  File "/opt/scp/software/Miniconda3/4.7.12.1/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/scp/software/Miniconda3/4.7.12.1/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data/load_dataset.py", line 8, in <module>
    from lrtc_lib.data_access import single_dataset_loader
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/single_dataset_loader.py", line 13, in <module>
    import lrtc_lib.data_access.data_access_factory as data_access_factory
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/data_access_factory.py", line 8, in <module>
    from lrtc_lib.data_access.processors.data_processor_api import DataProcessorAPI
(kpvv542) (lrtc_env) [kpvv542@seskscpn080 low-resource-text-classification-framework]$ python -m lrtc_lib.data.load_dataset
Traceback (most recent call last):
  File "/opt/scp/software/Miniconda3/4.7.12.1/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/scp/software/Miniconda3/4.7.12.1/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data/load_dataset.py", line 8, in <module>
    from lrtc_lib.data_access import single_dataset_loader
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/single_dataset_loader.py", line 13, in <module>
    import lrtc_lib.data_access.data_access_factory as data_access_factory
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/data_access_factory.py", line 8, in <module>
    from lrtc_lib.data_access.processors.data_processor_api import DataProcessorAPI
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/processors/data_processor_api.py", line 10, in <module>
    import lrtc_lib.orchestrator.orchestrator_api as orchestrator_api
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/orchestrator/orchestrator_api.py", line 23, in <module>
    from lrtc_lib.training_set_selector import training_set_selector_factory
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/training_set_selector/training_set_selector_factory.py", line 6, in <module>
    from lrtc_lib.training_set_selector.train_and_dev_sets_selectors import TrainAndDevSetsSelectorAllLabeled, TrainAndDevSetsSelectorAllLabeledPlusUnlabeledAsWeakNegative
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/training_set_selector/train_and_dev_sets_selectors.py", line 14, in <module>
    data_access = data_access_factory.get_data_access()
AttributeError: module 'lrtc_lib.data_access.data_access_factory' has no attribute 'get_data_access'

It seems to be the same error but in a different way... (?)

arielge commented 3 years ago

Indeed looks like more of the same. Can you try restoring lrtc_lib/data_access/data_access_factory.py to the committed version? You should see only two imports there (DataAccessApi and DataAccessInMemory)

leosouliotis commented 3 years ago

Thanks for the suggestion! Now we got a different error.

2021-06-30 14:03:51.615136: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2021-06-30 14:03:51.615318: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2021-06-30 14:03:51.615330: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Traceback (most recent call last):
  File "/opt/scp/software/Miniconda3/4.7.12.1/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
(kpvv542) (lrtc_env) [kpvv542@seskscpn080 low-resource-text-classification-framework]$ python -m lrtc_lib.data.load_dataset
2021-06-30 14:05:47.231191: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2021-06-30 14:05:47.231392: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2021-06-30 14:05:47.231406: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Traceback (most recent call last):
  File "/opt/scp/software/Miniconda3/4.7.12.1/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/scp/software/Miniconda3/4.7.12.1/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data/load_dataset.py", line 31, in <module>
    load(dataset=dataset_name)
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data/load_dataset.py", line 21, in load
    single_dataset_loader.load_dataset(dataset_name, force_new)
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/single_dataset_loader.py", line 38, in load_dataset
    data_processor: DataProcessorAPI = processor_factory.get_data_processor(dataset_name)
  File "/home/kpvv542/Projects/low-resource-text-classification-framework/lrtc_lib/data_access/processors/data_processor_factory.py", line 25, in get_data_processor
    return TrialTroveProcessor(dataset_part=dataset_part)
TypeError: Can't instantiate abstract class TrialTroveProcessor with abstract methods _get_dev_file_path, _get_test_file_path, _get_train_file_path

But let me explain further: I have created a processor in the following way:

from lrtc_lib.data_access.processors.dataset_part import DatasetPart
from lrtc_lib.data_access.processors.data_processor_api import DataProcessorAPI

class TrialTroveProcessor(DataProcessorAPI):

    def __init__(self, dataset_part: DatasetPart, label_col: str = 'target'):
        super().__init__(dataset_name='trialtrove', dataset_part=dataset_part)

and to data_processor_factory.py I have added the following:

from lrtc_lib.data_access.processors.process_trialtrove import TrialTroveProcessor

    if dataset_source == 'trialtrove':
        return TrialTroveProcessor(dataset_part=dataset_part)
arielge commented 3 years ago

I see. Your TrialTroveProcessor inherits from DataProcessorAPI, which has some abstract methods (see https://docs.python.org/3/library/abc.html). This means if you inherit from it you must override these methods, namely _get_train_file_path, _get_dev_file_path, _get_test_file_path.

leosouliotis commented 3 years ago

Thanks for your time and effort @arielge!

Changing to CsvProcessor rather than DataProcessorAPI solved the issue! Maybe worth pointing out the extra steps needed if someoine wants to implement the full DataProcessorAPI?

Feel free to close this issue, thanks again!