EricFillion / happy-transformer

Happy Transformer makes it easy to fine-tune and perform inference with NLP Transformer models.
http://happytransformer.com
Apache License 2.0
517 stars 66 forks source link

GPT-Neo fine-tuning does not insert new lines in the output results #283

Closed MattJeanes closed 2 years ago

MattJeanes commented 2 years ago

Hey, not sure if this is a bug or my input training file (train.txt) is not formatted correctly I cannot get it to output the new line character in the results. The format of the data set is a conversation comprised of around 230,000 messages and I have tried the following formats:

Single new line between messages:

Person 1: message
Person 2: message
Person 3: message

Explicit new line character after end of message (resulted in escaped new line in output data):

Person 1: message\n
Person 2: message\n
Person 3: message\n

Double new line between messages:

Person 1: message

Person 2: message

Person 3: message

Using end of text marker (NOTE: split every 10,000 messages): This is the format I have used for GPT-2 training before which has worked

Person 1: message
Person 2: message
<|endoftext|>
Person 3: message

All of these result in the output text being all on one line after some time training, for example: Person 1: some messagePerson2: another messagePerson3: one more message. Another model I have works correctly, so I don't believe it is the way I am inferencing on the completed model.

Here is the code I use to train the model:

from happytransformer import GENTrainArgs
args = GENTrainArgs(num_train_epochs = 1,
    # Note: previously ran with save instead of load here
    load_preprocessed_data = True, load_preprocessed_data_path = "/content/drive/MyDrive/gpt-neo/preprocessed.json"
)
while True:
  happy_gen.train("/content/drive/MyDrive/gpt-neo/train.txt", args=args)
  happy_gen.save("/content/drive/MyDrive/gpt-neo/model")

And here is the code I am using to test inference:

from transformers import pipeline
generator = pipeline('text-generation', model='/content/drive/MyDrive/gpt-neo/model')
generator("Person 1:", do_sample=True, min_length=50, max_length=200)

So is there something I am doing wrong or is this an issue with Happy Transformers? Thanks

AbdelrhmanNile commented 2 years ago

i have the same issue but with gpt2

EricFillion commented 2 years ago

Thank you @MattJeanes for a detailed analysis and thank you @AbdelrhmanNile for confirming the issue. I created a PR that I think solves the issue.

MattJeanes commented 2 years ago

That's great to hear! Thank you I'll be sure to give it a go when merged 😁