ThilinaRajapakse / pytorch-transformers-classification

Based on the Pytorch-Transformers library by HuggingFace. To be used as a starting point for employing Transformer models in text classification tasks. Contains code to easily train BERT, XLNet, RoBERTa, and XLM models for text classification.
Apache License 2.0
306 stars 97 forks source link

Minor Issue :2 - Reading input files. #3

Closed pythonometrist closed 5 years ago

pythonometrist commented 5 years ago

The data processor function identifies the labels and text by column position.

def _create_examples(self, lines, set_type): """Creates examples for the training and dev sets.""" examples = [] for (i, line) in enumerate(lines): guid = "%s-%s" % (set_type, i) text_a = line[3] label = line[1] examples.append( InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) return examples

This is a problem as pandas is used to generate the tsv files, and across 0.24, 0.25 there is a difference in the order in which the columns are saved. It might be better to save the column names ad directly name the label column. I ran into this issue as I had to operate on two different machines and on the second machine it would crash - saying label_id used before being assigned.

ThilinaRajapakse commented 5 years ago

I'll check this. But any version of pandas should be saving the df in the same order it is in. Maybe there's something going on with my saving code.

ThilinaRajapakse commented 5 years ago

You were right. 0.24 seems to have been doing weird stuff when writing out dfs to files. I changed the Colab notebook and the data_prep notebook to specify the column names when writing the tsv files.

pythonometrist commented 5 years ago

Thanks!!