ProsusAI / finBERT

Financial Sentiment Analysis with BERT
Apache License 2.0
1.42k stars 413 forks source link

Unable to train the model #16

Closed nayomirana closed 4 years ago

nayomirana commented 4 years ago

Hi,

I downloaded the data set from the Financial Phrase Bank from Malo et al. (2014). And created train.csv using the data. train_data = finbert.get_data('train') But for the above code snippets in "finbert_training"-notebook, an error message was generated as follows.

Is there any method to resolve this issue.

Thanks..


UnicodeDecodeError Traceback (most recent call last)

in 7 #print(cl_data_path) 8 # Get the training examples ----> 9 train_data = finbert.get_data('train') ~\Documents\FIN_BERT\finBERT-master\finbert\finbert.py in get_data(self, phase) 192 self.num_train_optimization_steps = None 193 examples = None --> 194 examples = self.processor.get_examples(self.config.data_dir, phase) 195 self.num_train_optimization_steps = int( 196 len( ~\Documents\FIN_BERT\finBERT-master\finbert\utils.py in get_examples(self, data_dir, phase) 89 Name of the .csv file to be loaded. 90 """ ---> 91 return self._create_examples(self._read_tsv(os.path.join(data_dir, (phase + ".csv"))), phase) 92 93 def get_labels(self): ~\Documents\FIN_BERT\finBERT-master\finbert\utils.py in _read_tsv(cls, input_file) 66 reader = csv.reader(f, delimiter="\t") 67 lines = [] ---> 68 for line in reader: 69 if sys.version_info[0] == 2: 70 line = list(unicode(cell, 'utf-8') for cell in line) D:\Python\lib\encodings\cp1252.py in decode(self, input, final) 21 class IncrementalDecoder(codecs.IncrementalDecoder): 22 def decode(self, input, final=False): ---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0] 24 25 class StreamWriter(Codec,codecs.StreamWriter):_ UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7919: character maps to
akmalsabri commented 4 years ago

hi @nayomirana . how u solve this?

Buzzpod commented 2 years ago

The solution is to edit line 69 in finbert/utils.py as follows:

Original: with open(input_file, "r") as f: Modified: with open(input_file, "r", encoding="utf8") as f:

You have to add the encoding="utf8" parameter to the open function in order to avoid the above 'charmap' error