Closed JacksonKearl closed 4 years ago
I note that there is a section for data format:
Data format
The data should simply be placed in a text file. E.g.: WikiText-2
Going to that webpage and downloading the "WikiText-2 word level" data, I get test/train/valid files that look like:
= = Description = =
The fruitbodies ( <unk> ) of Geopyxis <unk> are cup shaped , 1 – 2 cm wide , and have fringed whitish margins . The inner spore @-@ bearing surface of the cup , the hymenium , is brick red and smooth , while the exterior surface is a dull yellow , and may be either smooth or have <unk> @-@ like spots ( <unk> ) . The stipe is small ( 1 – 1 @.@ 5 mm long and 1 – 2 mm wide ) , whitish in color , and expands abruptly into the cup . The brownish flesh of the fungus is thin and brittle . It does not have any distinctive taste , but has an unpleasant smell when crushed in water . The edibility of the fungus is not known , but the fruitbodies are <unk> and unlikely to be harvested for eating .
= = = Microscopic characteristics = = =
In mass , the spores are whitish . The spores are elliptical , smooth , hyaline , devoid of oil droplets ( <unk> ) , and have dimensions of 13 – 18 by 7 – 9 µm . They are thin walled and germinate and grow rapidly in vitro in the absence of external <unk> . The asci are 190 – 225 by 9 – 10 µm . The paraphyses are slightly club @-@ shaped , <unk> , and have irregular orange @-@ brown granules , with tips up to 5 µm wide , and are not forked or lobed . The <unk> , the layer of cells below the hymenium , is made of densely packed , small irregular cells .
= = = Similar species = = =
The closely related <unk> elf cup ( Geopyxis <unk> ) has a pale orange to yellowish <unk> that is deeply cup shaped before flattening in maturity , and its crushed flesh often has an odor of sulfur . It may be distinguished microscopically by its paraphyses , which lack the orange @-@ brown granules characteristic of G. carbonaria . It also has larger spores , measuring 14 – 22 by 8 – 11 µm . Unlike G. carbonaria , it grows on substrates other than burned wood , including mosses , and needle duff . <unk> <unk> , which grows habitats similar to G. carbonaria , is distinguished microscopically by its spores that contain two oil droplets . Other genera with similar species with which G. carbonaria may be confused in the field include <unk> , <unk> , <unk> , and <unk> .
of note are the @-@
tokens and the <unk>
tokens. Are there any resources on how to generate similar files based on other datasets?
The terminology is admittedly a little confusing. Let me try to break it down. If you consider the process of going from a bare-bones, randomly-initialized model (i.e. no training at all) to the final multiclass classifier for Github issues, it would look like this.
bert-base-cased
. I.e., a model that "understands" the language, but is not trained to do the task you want (multiclass classification). Since this step is common for all NLP tasks with Transformers, we don't need to do it every time. Instead, we take a pre-trained model like bert-base-cased
and we start with step 2.bert-base-cased
) might not be particularly good at "understanding" highly technical language of the sort you find in Github issues, as it is pre-trained on generic English text. So, in order to improve its understanding of technical language, you can further pre-train the pre-trained model (from step 1) on your dataset containing only the text extracted from Github. This is referred to as fine-tuning the language model because you are essentially taking the model trained on generic English, and fine-tuning it to perform better on technical English. However, the training objective and the procedure is identical in both step 1 and step 2. Because of this (I think), you can also say that you are pre-training the pre-trained model on the custom dataset. At the end of step 2, you will (hopefully) have a language model that can understand both generic English and the specialized, technical language from your task.ClassificationModel
), and fine-tune the ClassificationModel
to actually classify the issues.I'm sorry if this was too basic or unnecessarily detailed! TLDR is that, in this context, fine-tuning the language model is the same as pre-training.
Your understanding of the pre-training dataset is correct. The train/test split would simply be the large, unlabelled dataset split into two pieces.
The data you pass in for pre-training doesn't need any special formatting. You can have a text file containing issue text, with one issue per line. Keep in mind that any lines longer than max_seq_length
tokens will be truncated (max_seq_length
also has an upper limit of 512 for BERT). If you find that your issues are longer than this, then it may be best to split the issue texts across multiple lines. Essentially, the text on any given line in the text file will be considered a single sample.
For example, if you consider this same issue thread, you can prepare a text file like this.
You might find these articles helpful as well.
I know that was probably a big info dump so please let me know if anything needs to be clarified!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
First of all, thanks for your work on this project and taking the time to help out with trying it out on the vscode issue stream!
ref https://github.com/microsoft/vscode-github-triage-actions/issues/5
Reading through https://github.com/ThilinaRajapakse/simpletransformers#minimal-example-for-language-model-fine-tuning, it's not clear to me if this is fine-tuning or pre-training. It says fine-tuning, but I was linked to it as an example of pre-training.
If it is for pre-training, it's not clear to me what the test/train split should be. It was my impression that in pre-training we simply feed a very large set of unlabeled documents and the model learns from it -- I don't know what test vs. train would be in this case.
Additionally, it's not clear to me how the data I pass in for pretraining should be formatted.
Thanks again, Jackson