Open-Power-System-Data / time_series

Data package: time series of load, wind and solar generation
http://data.open-power-system-data.org/time_series/
MIT License
122 stars 40 forks source link

time_series download script #23

Closed sn4i1 closed 5 years ago

sn4i1 commented 6 years ago

Hi,

I've came across this issue, when downloading TS data (reading ENTSO-E Data Portal) from the sources included in sources.yml (https://github.com/Open-Power-System-Data/time_series/blob/master/timeseries_scripts/download.py):

2018-04-30 19:16:37 INFO reading ENTSO-E Data Portal - load Progress: ██████████████████████████████████████████████████ 0/120 files

`XLRDError Traceback (most recent call last)

in () 12 url, res_key, headers, 13 start_from_user=start_from_user, ---> 14 end_from_user=end_from_user) 15 16 os.makedirs(res_key, exist_ok=True) MYDIR/project/processing/timeseries_scripts/read.py in read(data_path, areas, source_name, variable_name, url, res_key, headers, start_from_user, end_from_user) 1312 areas, filepath, variable_name, url, headers, res_key) 1313 elif source_name == 'ENTSO-E Data Portal': -> 1314 data_to_add = read_entso_e_portal(filepath, url, headers) 1315 elif source_name == 'ENTSO-E Power Statistics': 1316 data_to_add = read_entso_e_statistics(filepath, url, headers) MYDIR/project/processing/timeseries_scripts/read.py in read_entso_e_portal(filepath, url, headers) 543 '''Read a file from ENTSO-E into a DataFrame''' 544 df = pd.read_excel( --> 545 io=xlrd.open_workbook(filepath, logfile=open(os.devnull, 'w')), 546 header=9, # 0 indexed, so the column names are actually in the 10th row 547 skiprows=None, MYDIR/project/sources/lib/python3.6/site-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows) 160 formatting_info=formatting_info, 161 on_demand=on_demand, --> 162 ragged_rows=ragged_rows, 163 ) 164 return bk MYDIR/project/sources/lib/python3.6/site-packages/xlrd/book.py in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows) 89 t1 = time.clock() 90 bk.load_time_stage_1 = t1 - t0 ---> 91 biff_version = bk.getbof(XL_WORKBOOK_GLOBALS) 92 if not biff_version: 93 raise XLRDError("Can't determine file's BIFF version") MYDIR/project/sources/lib/python3.6/site-packages/xlrd/book.py in getbof(self, rqd_stream) 1269 bof_error('Expected BOF record; met end of file') 1270 if opcode not in bofcodes: -> 1271 bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8]) 1272 length = self.get2bytes() 1273 if length == MY_EOF: MYDIR/project/sources/lib/python3.6/site-packages/xlrd/book.py in bof_error(msg) 1263 if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile) 1264 def bof_error(msg): -> 1265 raise XLRDError('Unsupported format, or corrupt file: ' + msg) 1266 savpos = self._position 1267 opcode = self.get2bytes() XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'
jgmill commented 6 years ago

ENTSO-E has recently overhauled their website, and it seems that in te process they switched off access to the old data portal that supplied the 2006-2015 hourly load data. That data is supposedly now available as one big file under https://www.entsoe.eu/data/data-portal, but the link doesn't work (yet). When it does, I will update the script accordingly.

In the meantime, what you could do is

  • either to set archive_version = 2018-03-13 in the processing.ipynb in order to download an archived version of the data from the OPSD Server
  • or to set exclude = ['ENTSO-E Data Portal'] in order to skip this source when downloading/reading the data.
jgmill commented 5 years ago

I have fixed this in the current version of the time series data package: https://data.open-power-system-data.org/time_series/2019-06-05