autonomio / talos

Hyperparameter Experiments with TensorFlow and Keras
https://autonom.io
MIT License
1.62k stars 268 forks source link

CSV parsererror with TerminateOnNaN callback #297

Closed vendetta1987 closed 5 years ago

vendetta1987 commented 5 years ago

I noticed my talos scans crashing recently and found the following stack trace:

File "pathtomy\script.py", line 431, in runTalosOptimization(_TRAIN_DATA_GEN, talosParams) File "pathtomy\script.py", line 350, in runTalosOptimization print_params=True, dataset_name=TALOS_LOG_NAME, experiment_no=str(expNr)) File "D:\conda\envs\pip_only\lib\site-packages\talos\scan\Scan.py", line 170, in init self._null = self.runtime() File "D:\conda\envs\pip_only\lib\site-packages\talos\scan\Scan.py", line 175, in runtime self = scan_run(self) File "D:\conda\envs\pip_only\lib\site-packages\talos\scan\scan_run.py", line 18, in scan_run self = scan_round(self) File "D:\conda\envs\pip_only\lib\site-packages\talos\scan\scan_round.py", line 65, in scan_round self = reduce_run(self) File "D:\conda\envs\pip_only\lib\site-packages\talos\reducers\reduce_run.py", line 16, in reduce_run self = reduce_prepare(self) File "D:\conda\envs\pip_only\lib\site-packages\talos\reducers\reduce_prepare.py", line 13, in reduce_prepare self.data = pd.read_csv(self.experiment_name + '.csv') File "D:\conda\envs\pip_only\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f return _read(filepath_or_buffer, kwds) File "D:\conda\envs\pip_only\lib\site-packages\pandas\io\parsers.py", line 435, in _read data = parser.read(nrows) File "D:\conda\envs\pip_only\lib\site-packages\pandas\io\parsers.py", line 1139, in read ret = self._engine.read(nrows) File "D:\conda\envs\pip_only\lib\site-packages\pandas\io\parsers.py", line 1995, in read data = self._reader.read(nrows) File "pandas_libs\parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read File "pandas_libs\parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory File "pandas_libs\parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows File "pandas_libs\parsers.pyx", line 955, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas_libs\parsers.pyx", line 2172, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 18 fields in line 5, saw 20

Indeed the CSV report doesn't contain validation values for some runs that ended in NaN or Inf loss values. This seems to be related but not limited to the usage of the TerminateOnNaN() callback provided with keras. I thought the callback was quite helpful to shorten scanning times but if the first run gets terminated automatically, the resulting CSV report lacks the necessary columns to continue:

round_epochs,loss,mean_squared_error,convBlockCnt,filterCntStart,b1_size,b1_drop,b2_size,b2_drop,b3_size,b3_drop,b4_size,b4_drop,b5_size,b5_drop,dense_size,dense_act,dense_drop 1,inf,inf,3,32,7,0.1,3,0.30000000000000004,7,0.1,3,0.5000000000000001,7,0.1,308,<function relu at 0x000001A939FF06A8>,0.30000000000000004 1,nan,nan,3,32,3,0.5000000000000001,7,0.5000000000000001,5,0.30000000000000004,5,0.5000000000000001,5,0.7000000000000001,308,<function linear at 0x000001A939FF0950>,0.5000000000000001 1,nan,nan,5,8,3,0.30000000000000004,5,0.30000000000000004,5,0.7000000000000001,7,0.7000000000000001,3,0.5000000000000001,308,<function linear at 0x000001A939FF0950>,0.7000000000000001 1,0.028076652523707033,0.028076652523707033,0.1728719637854572,0.1728719637854572,2,8,3,0.30000000000000004,3,0.7000000000000001,5,0.5000000000000001,5,0.1,5,0.1,212,<function relu at 0x000001A939FF06A8>,0.30000000000000004 1,0.047796255958774674,0.047796255958774674,0.17865630197601579,0.17865630197601579,5,16,7,0.1,3,0.7000000000000001,3,0.1,3,0.5000000000000001,7,0.5000000000000001,116,<function linear at 0x000001A939FF0950>,0.30000000000000004

Maybe talos could check whether all data is available and fill gaps with something meaningful. It shouldn't crash at least.

mikkokotila commented 5 years ago

I checked the data and the error message, but it appears that there are 18 columns (and not 20 as the error message says in the provided trace). The fact that the column has NaN does not cause the error you have in the trace. Or did I miss something?

vendetta1987 commented 5 years ago

Thank you for picking this up.The problem seems to stem from the fact that Talos is expecting the amount of columns the first run returned. If it was successful the table will contain validation data columns. Of it wasn't those will be missing from the header. Upon the next run the CSV will be read and if that run had a different outcome its amount of result columns will be different. Internally Pandas will then throw the given error.This happens with interleaved NaN/Inf runs especially.

mikkokotila commented 5 years ago

Thanks for clarification. Can you share with me the output csv you get where this error is reported.

vendetta1987 commented 5 years ago

It's the same I used when first posting. Looking at the last two lines you'll notice those actually contain two additional column entries, both validation results.

mikkokotila commented 5 years ago

Can you try:

pip install -U git+https://github.com/autonomio/talos@params-api-test

This installs the next major release (v.0.6) where logging has been to an important part rebuilt.

vendetta1987 commented 5 years ago

I now updated to the branch and fixed the errors that popped up. Seems some API changes will need to be updated in the documentation. I let it run for a weekend and it was still running upon my return. Although there were no NaN results, those were fixed elsewhere, the new Scan() method doesn't seem to do random testing anymore. The following code will happily check all possible parameter combinations in order. There is no random choise of combinations to cover more ground more quickly.

ta.Scan(dummy[0], dummy[1], talosParams, modelFncTalos, random_method="quantum", clear_session=True, reduction_method="correlation", reduction_metric="val_loss", minimize_loss=True, reduction_interval=5, reduction_window=5, print_params=True, experiment_name=TALOS_LOG_NAME+"_"+str(expNr))

What needs to be changed in the call to get the random sampling back? I do like the fact it doesn't create all combinations at the beginning while hogging RAM though.

mikkokotila commented 5 years ago

What do you mean by not doing random testing anymore? In your case, you have chosen to use quantum random method, so that will be used for pick n permutations. But I can see that you did not invoke fraction_limit so you will get all the permutations in your parameter space.

mikkokotila commented 5 years ago

Closing here. Feel free to open new issue if anything. Thanks and have a great day too! :)