Confusion regarding docs on pre-training models

JacksonKearl commented 4 years ago

First of all, thanks for your work on this project and taking the time to help out with trying it out on the vscode issue stream!

ref https://github.com/microsoft/vscode-github-triage-actions/issues/5

Reading through https://github.com/ThilinaRajapakse/simpletransformers#minimal-example-for-language-model-fine-tuning, it's not clear to me if this is fine-tuning or pre-training. It says fine-tuning, but I was linked to it as an example of pre-training.

If it is for pre-training, it's not clear to me what the test/train split should be. It was my impression that in pre-training we simply feed a very large set of unlabeled documents and the model learns from it -- I don't know what test vs. train would be in this case.

Additionally, it's not clear to me how the data I pass in for pretraining should be formatted.

Thanks again, Jackson

JacksonKearl commented 4 years ago

I note that there is a section for data format:

Data format

The data should simply be placed in a text file. E.g.: WikiText-2

Going to that webpage and downloading the "WikiText-2 word level" data, I get test/train/valid files that look like:

 = = Description = = 

 The fruitbodies ( <unk> ) of Geopyxis <unk> are cup shaped , 1 – 2 cm wide , and have fringed whitish margins . The inner spore @-@ bearing surface of the cup , the hymenium , is brick red and smooth , while the exterior surface is a dull yellow , and may be either smooth or have <unk> @-@ like spots ( <unk> ) . The stipe is small ( 1 – 1 @.@ 5 mm long and 1 – 2 mm wide ) , whitish in color , and expands abruptly into the cup . The brownish flesh of the fungus is thin and brittle . It does not have any distinctive taste , but has an unpleasant smell when crushed in water . The edibility of the fungus is not known , but the fruitbodies are <unk> and unlikely to be harvested for eating . 

 = = = Microscopic characteristics = = = 

 In mass , the spores are whitish . The spores are elliptical , smooth , hyaline , devoid of oil droplets ( <unk> ) , and have dimensions of 13 – 18 by 7 – 9 µm . They are thin walled and germinate and grow rapidly in vitro in the absence of external <unk> . The asci are 190 – 225 by 9 – 10 µm . The paraphyses are slightly club @-@ shaped , <unk> , and have irregular orange @-@ brown granules , with tips up to 5 µm wide , and are not forked or lobed . The <unk> , the layer of cells below the hymenium , is made of densely packed , small irregular cells . 

 = = = Similar species = = = 

 The closely related <unk> elf cup ( Geopyxis <unk> ) has a pale orange to yellowish <unk> that is deeply cup shaped before flattening in maturity , and its crushed flesh often has an odor of sulfur . It may be distinguished microscopically by its paraphyses , which lack the orange @-@ brown granules characteristic of G. carbonaria . It also has larger spores , measuring 14 – 22 by 8 – 11 µm . Unlike G. carbonaria , it grows on substrates other than burned wood , including mosses , and needle duff . <unk> <unk> , which grows habitats similar to G. carbonaria , is distinguished microscopically by its spores that contain two oil droplets . Other genera with similar species with which G. carbonaria may be confused in the field include <unk> , <unk> , <unk> , and <unk> .

of note are the @-@ tokens and the <unk> tokens. Are there any resources on how to generate similar files based on other datasets?

ThilinaRajapakse commented 4 years ago

The terminology is admittedly a little confusing. Let me try to break it down. If you consider the process of going from a bare-bones, randomly-initialized model (i.e. no training at all) to the final multiclass classifier for Github issues, it would look like this.

Pre-train the model on a gigantic corpus of text so that the model can learn the language (English). This can be called pre-training because you are not training the model to do the actual task you want it to do. This is done by using different techniques depending on the model. At the end of this stage, you will have a model like bert-base-cased. I.e., a model that "understands" the language, but is not trained to do the task you want (multiclass classification). Since this step is common for all NLP tasks with Transformers, we don't need to do it every time. Instead, we take a pre-trained model like bert-base-cased and we start with step 2.
This step is where the confusion arises. The pre-trained model from step 1 (bert-base-cased) might not be particularly good at "understanding" highly technical language of the sort you find in Github issues, as it is pre-trained on generic English text. So, in order to improve its understanding of technical language, you can further pre-train the pre-trained model (from step 1) on your dataset containing only the text extracted from Github. This is referred to as fine-tuning the language model because you are essentially taking the model trained on generic English, and fine-tuning it to perform better on technical English. However, the training objective and the procedure is identical in both step 1 and step 2. Because of this (I think), you can also say that you are pre-training the pre-trained model on the custom dataset. At the end of step 2, you will (hopefully) have a language model that can understand both generic English and the specialized, technical language from your task.
Step 3 is where you will train the model to do the actual end-task, in this case, multiclass classification of Github issues. For this, you can take the fine-tuned language model from step 2, slap on a classification layer (done automatically when you create a ClassificationModel), and fine-tune the ClassificationModel to actually classify the issues.

I'm sorry if this was too basic or unnecessarily detailed! TLDR is that, in this context, fine-tuning the language model is the same as pre-training.

Your understanding of the pre-training dataset is correct. The train/test split would simply be the large, unlabelled dataset split into two pieces.

The data you pass in for pre-training doesn't need any special formatting. You can have a text file containing issue text, with one issue per line. Keep in mind that any lines longer than max_seq_length tokens will be truncated (max_seq_length also has an upper limit of 512 for BERT). If you find that your issues are longer than this, then it may be best to split the issue texts across multiple lines. Essentially, the text on any given line in the text file will be considered a single sample.

For example, if you consider this same issue thread, you can prepare a text file like this.

You might find these articles helpful as well.

Language Model Fine-Tuning (Doesn't go into too much detail)
Language Model Fine-Tuning for GPT-2 language generation (Might help to get a better idea of what language model fine-tuning or pre-training a pre-trained model does)
Training a Language Model from scratch (A lot more in-depth article on the pre-training process)

ThilinaRajapakse commented 4 years ago

I know that was probably a big info dump so please let me know if anything needs to be clarified!

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ThilinaRajapakse / simpletransformers

Confusion regarding docs on pre-training models #372

Data format