ThilinaRajapakse / simpletransformers

Transformers for Information Retrieval, Text Classification, NER, QA, Language Modelling, Language Generation, T5, Multi-Modal, and Conversational AI
https://simpletransformers.ai/
Apache License 2.0
4.08k stars 727 forks source link

converting to features #225

Closed Lysimachos closed 4 years ago

Lysimachos commented 4 years ago

Describe the bug I am trying to use Roberta for multi-label classification. I am facing problems when converting to features both in model.eval and model.predict. When using the model.predict in the following code ( as given from simpletransformers) it prints: "Converting to features started. Cache is not used 0%| | 0/1 [00:00<?, ?it/s]" and then does nothing. looks like it gets into an infinite loop or something.

To Reproduce from simpletransformers.classification import MultiLabelClassificationModel import pandas as pd train_data = [['Example sentence 1 for multilabel classification.', [1, 1, 1, 1, 0, 1]]] + [['This is another example sentence. ', [0, 1, 1, 0, 0, 0]]] train_df = pd.DataFrame(train_data, columns=['text', 'labels'])

eval_data = [['Example eval sentence for multilabel classification.', [1, 1, 1, 1, 0, 1]], ['Example eval senntence belonging to class 2', [0, 1, 1, 0, 0, 0]]] eval_df = pd.DataFrame(eval_data,columns=['text', 'labels']))

model = MultiLabelClassificationModel('roberta', 'roberta-base', num_labels=6, args={'reprocess_input_data': True, 'overwrite_output_dir': True, 'num_train_epochs': 5}) print(train_df.head())

model.train_model(train_df)

result, model_outputs, wrong_predictions = model.eval_model(eval_df) print(result) print(model_outputs)

predictions, raw_outputs = model.predict(['This thing is entirely different from the other thing. ']) print(predictions) print(raw_outputs)

Lysimachos commented 4 years ago

Ok fixed as soon as I made : "use_multiprocessing": False

What does this thing do?

ThilinaRajapakse commented 4 years ago

When enabled, converting features is accelerated using multiprocessing on CPUs with multiple cores. Otherwise, feature conversion can take hours with large datasets.

Lysimachos commented 4 years ago

Thank you ThilinaRajapakse for you reply.

There seems to be an issue with "use_multiprocessing": True when gpu-cuda is enable.

ThilinaRajapakse commented 4 years ago

What are you running the code on? It's not an issue on my machine and I don't think anyone else has run into this issue either.

Lysimachos commented 4 years ago

Yes yes there seems to be an issue with my set up.

I am using Pycharm remote interpreter on a machine with GeForce RTX 2080 with cuda 10.2 and AMD ryzen Threadripper 2950X 16-Core Processor.

ThilinaRajapakse commented 4 years ago

Maybe it's related to the Pycharm remote interpreter. My setup is pretty similar to yours (RTX Titan and Ryzen 2700X). Unless there is an issue with the Threadripper series that I am not aware of. You could try setting process_count: 8 to see whether it makes a difference. From what I can remember, Threadripper has two separate processors on the same chip, right?

Lysimachos commented 4 years ago

It is not the remote interpreter thing because I tried to run directly on the machine, resulting to the same problem. Everything worked as it should when used the "process_count" : 8. I also used "process_count": 16 and everything worked fine.

Thank you for your help and interest Thilina.

ThilinaRajapakse commented 4 years ago

I suspect it has something to do with the Threadripper architecture and how the processes are distributed on the cores.

You are welcome!

Lysimachos commented 4 years ago

I am going to do some digging into this. As soon as I have something new to add I will inform you.

Thanks again

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Mokers1234 commented 3 years ago

I have encountered the same infinite loop, upon train_model:

from simpletransformers.classification import ClassificationModel import pandas as pd test = ClassificationModel("distilbert","distilbert-base-cased") a = pd.DataFrame() a['text'] = ['a','b','c','d','e'] a['labels'] = [1,2,3,4,5] test.train_model(a)

`2020-11-10 21:05:35.646764: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']

It hangs at this point for a moment, before restarting with apparently two threads:

`2020-11-10 21:07:34.159618: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll 2020-11-10 21:07:42.104897: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']

The above code repeats endlessly. I tried setting use_multiprocessing to False, to no avail. Could it be something to do with my cuda or torch versions?

Mokers1234 commented 3 years ago

The problem according to debugging is line 1083 in classifcation_model.py:

features = convert_examples_to_features(
                examples,
                args.max_seq_length,
                tokenizer,
                output_mode,
                # XLNet has a CLS token at the end
                cls_token_at_end=bool(args.model_type in ["xlnet"]),
                cls_token=tokenizer.cls_token,
                cls_token_segment_id=2 if args.model_type in ["xlnet"] else 0,
                sep_token=tokenizer.sep_token,
                # RoBERTa uses an extra separator b/w pairs of sentences,
                # cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
                sep_token_extra=bool(args.model_type in ["roberta", "camembert", "xlmroberta", "longformer"]),
                # PAD on the left for XLNet
                pad_on_left=bool(args.model_type in ["xlnet"]),
                pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
                pad_token_segment_id=4 if args.model_type in ["xlnet"] else 0,
                process_count=process_count,
                multi_label=multi_label,
                silent=args.silent or silent,
                use_multiprocessing=args.use_multiprocessing,
                sliding_window=args.sliding_window,
                flatten=not evaluate,
                stride=args.stride,
                add_prefix_space=bool(args.model_type in ["roberta", "camembert", "xlmroberta", "longformer"]),
                # avoid padding in case of single example/online inferencing to decrease execution time
                pad_to_max_length=bool(len(examples) > 1),
                args=args,
            )
ThilinaRajapakse commented 3 years ago

This is likely a Windows issue. Multiprocessing and Pytorch don't play nice with Windows.

You could try this fix.

For example:

def run():
    # Do everything here

if __name__ == '__main__':
    run()

I'm not sure if that'll fix it though.

BenF99 commented 3 years ago

I've also seem to be stuck at "Converting to features started. Cache is not used" - using a dataset containing 500,000 entries for binary classification.

Converting to Features works using CPU on local machine but crashes afterwards - I'm assuming due to the heaving demand when training.

When using Google Colab (CUDA)m I get stuck at "Converting to features started. Cache is not used" with no progress after 8 hours.

marcmk6 commented 2 years ago

This is still an issue when I tried to train a model with a total of ~100,000 entries. I'm pretty sure there are some processes crushed without warning as I got some linux core dump files.

IshchenkoRoman commented 2 years ago

Find solution for me: As writed above- main problem in multithreading in inference mode Solution- switch off multithreading, using args of ClassificationModel model

cm_object = ClassificationModel()
# cm_object = torch.load("./model.pt") # or use pretrained- will worked fine
cm_object.args.use_multiprocessing = False
cm_object.args.use_multiprocessing_for_evaluation = False
cm_object.args.multiprocessing_chunksize = 1
cm_object.args.dataloader_num_workers = 1

I know- it is solution for consequences, not for main reason, but maybe for someone it'll be helpful

serdarildercaglar commented 2 years ago

This is still an issue when I tried to train a model with a total of ~100,000 entries. I'm pretty sure there are some processes crushed without warning as I got some linux core dump files.

Same issue

peilongchencc commented 2 years ago

This is still an issue when I tried to train a model with a total of ~100,000 entries. I'm pretty sure there are some processes crushed without warning as I got some linux core dump files.

I also encounter the same issue. The following two pictures show the situation I train the classification model with 1000 examples. picture 1: image picture 2: image It seems all right, but if I expand the dataset to 7 million, 1700 categories. In the first case (picture 1), the train file would be killed when convert to features. In the second case (picture 2), the train file would get stuck at a certain moment., and I will get some large core.xxx files. image core files: image How can I deal with this situation, looking forward to your reply. @ThilinaRajapakse

Melcfrn commented 1 year ago

Found a solution for me : "use_multiprocessing_for_evaluation": True, "multiprocessing_chunksize": 5 (Take a number adapted to you)