converting to features - Githubissues

Lysimachos commented 4 years ago

Describe the bug I am trying to use Roberta for multi-label classification. I am facing problems when converting to features both in model.eval and model.predict. When using the model.predict in the following code ( as given from simpletransformers) it prints: "Converting to features started. Cache is not used 0%| | 0/1 [00:00<?, ?it/s]" and then does nothing. looks like it gets into an infinite loop or something.

To Reproduce from simpletransformers.classification import MultiLabelClassificationModel import pandas as pd train_data = [['Example sentence 1 for multilabel classification.', [1, 1, 1, 1, 0, 1]]] + [['This is another example sentence. ', [0, 1, 1, 0, 0, 0]]] train_df = pd.DataFrame(train_data, columns=['text', 'labels'])

eval_data = [['Example eval sentence for multilabel classification.', [1, 1, 1, 1, 0, 1]], ['Example eval senntence belonging to class 2', [0, 1, 1, 0, 0, 0]]] eval_df = pd.DataFrame(eval_data,columns=['text', 'labels']))

model = MultiLabelClassificationModel('roberta', 'roberta-base', num_labels=6, args={'reprocess_input_data': True, 'overwrite_output_dir': True, 'num_train_epochs': 5}) print(train_df.head())

model.train_model(train_df)

result, model_outputs, wrong_predictions = model.eval_model(eval_df) print(result) print(model_outputs)

predictions, raw_outputs = model.predict(['This thing is entirely different from the other thing. ']) print(predictions) print(raw_outputs)

Lysimachos commented 4 years ago

Ok fixed as soon as I made : "use_multiprocessing": False

What does this thing do?

ThilinaRajapakse commented 4 years ago

When enabled, converting features is accelerated using multiprocessing on CPUs with multiple cores. Otherwise, feature conversion can take hours with large datasets.

Lysimachos commented 4 years ago

Thank you ThilinaRajapakse for you reply.

There seems to be an issue with "use_multiprocessing": True when gpu-cuda is enable.

ThilinaRajapakse commented 4 years ago

What are you running the code on? It's not an issue on my machine and I don't think anyone else has run into this issue either.

Lysimachos commented 4 years ago

Yes yes there seems to be an issue with my set up.

I am using Pycharm remote interpreter on a machine with GeForce RTX 2080 with cuda 10.2 and AMD ryzen Threadripper 2950X 16-Core Processor.

ThilinaRajapakse commented 4 years ago

Maybe it's related to the Pycharm remote interpreter. My setup is pretty similar to yours (RTX Titan and Ryzen 2700X). Unless there is an issue with the Threadripper series that I am not aware of. You could try setting process_count: 8 to see whether it makes a difference. From what I can remember, Threadripper has two separate processors on the same chip, right?

Lysimachos commented 4 years ago

It is not the remote interpreter thing because I tried to run directly on the machine, resulting to the same problem. Everything worked as it should when used the "process_count" : 8. I also used "process_count": 16 and everything worked fine.

Thank you for your help and interest Thilina.

ThilinaRajapakse commented 4 years ago

I suspect it has something to do with the Threadripper architecture and how the processes are distributed on the cores.

You are welcome!

Lysimachos commented 4 years ago

I am going to do some digging into this. As soon as I have something new to add I will inform you.

Thanks again

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Mokers1234 commented 3 years ago

I have encountered the same infinite loop, upon train_model:

from simpletransformers.classification import ClassificationModel import pandas as pd test = ClassificationModel("distilbert","distilbert-base-cased") a = pd.DataFrame() a['text'] = ['a','b','c','d','e'] a['labels'] = [1,2,3,4,5] test.train_model(a)

`2020-11-10 21:05:35.646764: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']

This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. I1110 21:06:19.791997 16520 classification_model.py:1073] Converting to features started. Cache is not used. 0%| | 0/5 [00:00<?, ?it/s]`

It hangs at this point for a moment, before restarting with apparently two threads:

`2020-11-10 21:07:34.159618: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll 2020-11-10 21:07:42.104897: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']

This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. I1110 21:08:46.064993 8640 classification_model.py:1073] Converting to features started. Cache is not used. Traceback (most recent call last): File "", line 1, in File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 105, in spawn_main exitcode = _main(fd) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 114, in _main prepare(preparation_data) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 225, in prepare _fixup_main_from_path(data['init_main_from_path']) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path run_name="__mp_main") File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 263, in run_path pkg_name=pkg_name, script_name=fname) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 96, in _run_module_code mod_name, mod_spec, pkg_name, script_name) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "c:\Users\Mark\Documents\NLP_comments2\test_download.py", line 7, in test.train_model(a) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\site-packages\simpletransformers\classification\classification_model.py", line 383, in train_model train_dataset = self.load_and_cache_examples(train_examples, verbose=verbose) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\site-packages\simpletransformers\classification\classification_model.py", line 1112, in load_and_cache_examples args=args, File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\site-packages\simpletransformers\classification\classification_utils.py", line 427, in convert_examples_to_features with Pool(process_count) as p: File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\context.py", line 119, in Pool context=self.get_context()) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 174, in init self._repopulate_pool() File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 239, in _repopulate_pool w.start() File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\process.py", line 105, in start self._popen = self._Popen(self) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\popen_spawn_win32.py", line 33, in init__ prep_data = spawn.get_preparation_data(process_obj._name) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 143, in get_preparation_data _check_not_importing_main() File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main is not going to be frozen to produce an executable.''') RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.
```
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:

    if __name__ == '__main__':
        freeze_support()
        ...

The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
```
I1110 21:08:47.658993 8640 internal.py:138] Internal process exited I1110 21:08:50.152998 4352 classification_model.py:1073] Converting to features started. Cache is not used. Traceback (most recent call last): File "", line 1, in File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 105, in spawn_main exitcode = _main(fd) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 114, in _main prepare(preparation_data) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 225, in prepare _fixup_main_from_path(data['init_main_from_path']) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path run_name="__mp_main") File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 263, in run_path pkg_name=pkg_name, script_name=fname) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 96, in _run_module_code mod_name, mod_spec, pkg_name, script_name) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "c:\Users\Mark\Documents\NLP_comments2\test_download.py", line 7, in test.train_model(a) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\site-packages\simpletransformers\classification\classification_model.py", line 383, in train_model train_dataset = self.load_and_cache_examples(train_examples, verbose=verbose) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\site-packages\simpletransformers\classification\classification_model.py", line 1112, in load_and_cache_examples args=args, File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\site-packages\simpletransformers\classification\classification_utils.py", line 427, in convert_examples_to_features with Pool(process_count) as p: File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\context.py", line 119, in Pool context=self.get_context()) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 174, in init self._repopulate_pool() File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 239, in _repopulate_pool w.start() File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\process.py", line 105, in start self._popen = self._Popen(self) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\popen_spawn_win32.py", line 33, in init__ prep_data = spawn.get_preparation_data(process_obj._name) File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 143, in get_preparation_data _check_not_importing_main() File "C:\Users\Mark\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main is not going to be frozen to produce an executable.''') RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.
```
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:

    if __name__ == '__main__':
        freeze_support()
        ...

The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
```
I1110 21:08:51.215993 4352 internal.py:138] Internal process exited `

The above code repeats endlessly. I tried setting use_multiprocessing to False, to no avail. Could it be something to do with my cuda or torch versions?

Mokers1234 commented 3 years ago

The problem according to debugging is line 1083 in classifcation_model.py:

features = convert_examples_to_features(
                examples,
                args.max_seq_length,
                tokenizer,
                output_mode,
                # XLNet has a CLS token at the end
                cls_token_at_end=bool(args.model_type in ["xlnet"]),
                cls_token=tokenizer.cls_token,
                cls_token_segment_id=2 if args.model_type in ["xlnet"] else 0,
                sep_token=tokenizer.sep_token,
                # RoBERTa uses an extra separator b/w pairs of sentences,
                # cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
                sep_token_extra=bool(args.model_type in ["roberta", "camembert", "xlmroberta", "longformer"]),
                # PAD on the left for XLNet
                pad_on_left=bool(args.model_type in ["xlnet"]),
                pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
                pad_token_segment_id=4 if args.model_type in ["xlnet"] else 0,
                process_count=process_count,
                multi_label=multi_label,
                silent=args.silent or silent,
                use_multiprocessing=args.use_multiprocessing,
                sliding_window=args.sliding_window,
                flatten=not evaluate,
                stride=args.stride,
                add_prefix_space=bool(args.model_type in ["roberta", "camembert", "xlmroberta", "longformer"]),
                # avoid padding in case of single example/online inferencing to decrease execution time
                pad_to_max_length=bool(len(examples) > 1),
                args=args,
            )

ThilinaRajapakse commented 3 years ago

This is likely a Windows issue. Multiprocessing and Pytorch don't play nice with Windows.

You could try this fix.

For example:

def run():
    # Do everything here

if __name__ == '__main__':
    run()

I'm not sure if that'll fix it though.

BenF99 commented 3 years ago

I've also seem to be stuck at "Converting to features started. Cache is not used" - using a dataset containing 500,000 entries for binary classification.

Converting to Features works using CPU on local machine but crashes afterwards - I'm assuming due to the heaving demand when training.

When using Google Colab (CUDA)m I get stuck at "Converting to features started. Cache is not used" with no progress after 8 hours.

marcmk6 commented 2 years ago

This is still an issue when I tried to train a model with a total of ~100,000 entries. I'm pretty sure there are some processes crushed without warning as I got some linux core dump files.

IshchenkoRoman commented 2 years ago

Find solution for me: As writed above- main problem in multithreading in inference mode Solution- switch off multithreading, using args of ClassificationModel model

cm_object = ClassificationModel()
# cm_object = torch.load("./model.pt") # or use pretrained- will worked fine
cm_object.args.use_multiprocessing = False
cm_object.args.use_multiprocessing_for_evaluation = False
cm_object.args.multiprocessing_chunksize = 1
cm_object.args.dataloader_num_workers = 1

I know- it is solution for consequences, not for main reason, but maybe for someone it'll be helpful

serdarildercaglar commented 2 years ago

This is still an issue when I tried to train a model with a total of ~100,000 entries. I'm pretty sure there are some processes crushed without warning as I got some linux core dump files.

Same issue

peilongchencc commented 2 years ago

This is still an issue when I tried to train a model with a total of ~100,000 entries. I'm pretty sure there are some processes crushed without warning as I got some linux core dump files.

I also encounter the same issue. The following two pictures show the situation I train the classification model with 1000 examples. picture 1: picture 2: It seems all right, but if I expand the dataset to 7 million, 1700 categories. In the first case (picture 1), the train file would be killed when convert to features. In the second case (picture 2), the train file would get stuck at a certain moment., and I will get some large core.xxx files. core files: How can I deal with this situation, looking forward to your reply. @ThilinaRajapakse

Melcfrn commented 1 year ago

Found a solution for me : "use_multiprocessing_for_evaluation": True, "multiprocessing_chunksize": 5 (Take a number adapted to you)

ThilinaRajapakse / simpletransformers

converting to features #225