gammasim / simtools

Tools and applications for the Simulation System of the CTA Observatory.
https://gammasim.github.io/simtools
BSD 3-Clause "New" or "Revised" License
10 stars 1 forks source link

derive_mirror_rnda errors #176

Closed GernotMaier closed 2 years ago

GernotMaier commented 2 years ago

derive_mirro_rnda is not working for me, see error below.

There is a fix this in that line in the code, so this is probably known?

python applications/derive_mirror_rnda.py --site North --telescope MST-FlashCam-D --mean_d80 1.4 --sig_d80 0.16 --mirror_list mirror_MST_focal_lengths.dat --d80_list mirror_MST_D80.dat --rnda 0.0075
INFO::db_handler(l579)::_getSiteParametersYaml::Reading DB file /workdir/external/simulation-model-description/configReports/parValues-Sites.yml
Traceback (most recent call last):
  File "applications/derive_mirror_rnda.py", line 227, in <module>
    tel.changeParameter("mirror_list", args.mirror_list)
  File "/workdir/external/gammasim-tools/simtools/model/telescope_model.py", line 415, in changeParameter
    if not isinstance(value, type(self._parameters[parName]["Value"])):
TypeError: string indices must be integers
orelgueta commented 2 years ago

This happens without using the DB right?

GernotMaier commented 2 years ago

'useMongoDB: true' , so that should not be the issue.

I really struggle setting all the values in config.yml correctly, maybe we can improve the documentation and the error messages.

e.g., instead of writing:

ERROR::config(l84)::get::Config does not contain dataLocation

Write:

ERROR::config(l84)::get::Config keyword dataLocation not found in file <full path to file>

It took me a while to find out that dataLocation is a required keyword (for some reasons I had testdataLocation in my config.yaml file), and then I did not know from which directory the config.yml file was read.

Note that the full error message here is very confusing:

python gammasim-tools/applications/derive_mirror_rnda.py --site North --telescope MST-FlashCam-D --mean_d80 1.4 --sig_d80 0.16 --mirror_list mirror_MST_focal_lengths.dat --d80_list mirror_MST_D80.dat --rnda 0.0075
maierg@warp.zeuthen.desy.de's password:
ERROR::config(l84)::get::Config does not contain dataLocation
Traceback (most recent call last):
  File "gammasim-tools/applications/derive_mirror_rnda.py", line 305, in <module>
    meanD80, sigD80 = run(rndaStart)
  File "gammasim-tools/applications/derive_mirror_rnda.py", line 235, in run
    ray = RayTracing.fromKwargs(
  File "/workdir/external/gammasim-tools/simtools/ray_tracing.py", line 198, in fromKwargs
    return cls(**args, configData=configData)
  File "/workdir/external/gammasim-tools/simtools/ray_tracing.py", line 126, in __init__
    _parameterFile = io.getDataFile("parameters", "ray-tracing_parameters.yml")
  File "/workdir/external/gammasim-tools/simtools/io_handler.py", line 187, in getDataFile
    Path(cfg.get("dataLocation")).joinpath(parentDir).joinpath(fileName).absolute()
  File "/workdir/external/gammasim-tools/simtools/config.py", line 85, in get
    raise KeyError()
KeyError
INFO::db_handler(l187)::_closeSSHTunnel::Closing SSH tunnel(s)

It doesn't find a keyword in a config.yml, but most of the error message is about a database connection and ssh issues.

Don't worry about the error message itself. I think all issues I have is that files are not found in the modelFilesLocations. This is probably an error on how I set it up on my side (and we need to get this stuff into the database).

GernotMaier commented 2 years ago

I think I get it now: dataLocation needs to point to 'gammasim-tools/data' and it contains configuration files required to run gammasim-tools. That was completely unexpected!

As a user, I did not expect that anything else is needed when running e.g.,

python gammasim-tools/applications/derive_mirror_rnda.py --site North --telescope MST-FlashCam-D --mean_d80 1.4 --sig_d80 0.16 --mirror_list mirror_MST_focal_lengths.dat --d80_list mirror_MST_D80.dat --rnda 0.0075

This is quite a rich command line (nothing wrong with that), but it does not propose that a file name 'data/parameters/ray-tracing_parameters.yml' is required to run the application.

Why did it work for me a couple of months ago? I think the reason is that now I don't start the applications from the gammasim-tools directory but from somewhere else (which means the default paths are not set correctly).

RaulRPrado commented 2 years ago

I think I get it now: dataLocation needs to point to 'gammasim-tools/data' and it contains configuration files required to run gammasim-tools. That was completely unexpected!

The 'data' directory is need for many things. It is where we store many parameter files, layout files, test files etc.

I made it a config entry because we might want to move it somewhere in the future, but by now we keep it inside the main repo.

As a user, I did not expect that anything else is needed when running e.g.,

python gammasim-tools/applications/derive_mirror_rnda.py --site North --telescope MST-FlashCam-D --mean_d80 1.4 --sig_d80 0.16 --mirror_list mirror_MST_focal_lengths.dat --d80_list mirror_MST_D80.dat --rnda 0.0075

This is quite a rich command line (nothing wrong with that), but it does not propose that a file name 'data/parameters/ray-tracing_parameters.yml' is required to run the application.

It should not. This file is needed anytime one needs to run a ray tracing module.

Why did it work for me a couple of months ago? I think the reason is that now I don't start the applications from the gammasim-tools directory but from somewhere else (which means the default paths are not set correctly).

Yes, that is probably the reason. We could use the full path as the default for the data dir in the config file. I will open an issue about it.

RaulRPrado commented 2 years ago

Note that the full error message here is very confusing:

python gammasim-tools/applications/derive_mirror_rnda.py --site North --telescope MST-FlashCam-D --mean_d80 1.4 --sig_d80 0.16 --mirror_list mirror_MST_focal_lengths.dat --d80_list mirror_MST_D80.dat --rnda 0.0075
maierg@warp.zeuthen.desy.de's password:
ERROR::config(l84)::get::Config does not contain dataLocation
Traceback (most recent call last):
  File "gammasim-tools/applications/derive_mirror_rnda.py", line 305, in <module>
    meanD80, sigD80 = run(rndaStart)
  File "gammasim-tools/applications/derive_mirror_rnda.py", line 235, in run
    ray = RayTracing.fromKwargs(
  File "/workdir/external/gammasim-tools/simtools/ray_tracing.py", line 198, in fromKwargs
    return cls(**args, configData=configData)
  File "/workdir/external/gammasim-tools/simtools/ray_tracing.py", line 126, in __init__
    _parameterFile = io.getDataFile("parameters", "ray-tracing_parameters.yml")
  File "/workdir/external/gammasim-tools/simtools/io_handler.py", line 187, in getDataFile
    Path(cfg.get("dataLocation")).joinpath(parentDir).joinpath(fileName).absolute()
  File "/workdir/external/gammasim-tools/simtools/config.py", line 85, in get
    raise KeyError()
KeyError
INFO::db_handler(l187)::_closeSSHTunnel::Closing SSH tunnel(s)

It doesn't find a keyword in a config.yml, but most of the error message is about a database connection and ssh issues.

The error is as clear as we can make it. It is a bit confusing because of python. The large block about KeyError is inevitable, and the one line about the DB is because we close the connection after the error. What we can do is to make the line "ERROR::config(l84)::get::Config does not contain dataLocation" a bit clearer as you suggested. I will create a issue for that and fix it next.

RaulRPrado commented 2 years ago

Is it working in the end, @GernotMaier ?

GernotMaier commented 2 years ago

Is it working in the end, @GernotMaier ?

Yes, but it took me two hours. I think this shows you that the setup is ideal for users who are not very familiar with the system.

GernotMaier commented 2 years ago

Note that the full error message here is very confusing:

python gammasim-tools/applications/derive_mirror_rnda.py --site North --telescope MST-FlashCam-D --mean_d80 1.4 --sig_d80 0.16 --mirror_list mirror_MST_focal_lengths.dat --d80_list mirror_MST_D80.dat --rnda 0.0075
maierg@warp.zeuthen.desy.de's password:
ERROR::config(l84)::get::Config does not contain dataLocation
Traceback (most recent call last):
  File "gammasim-tools/applications/derive_mirror_rnda.py", line 305, in <module>
    meanD80, sigD80 = run(rndaStart)
  File "gammasim-tools/applications/derive_mirror_rnda.py", line 235, in run
    ray = RayTracing.fromKwargs(
  File "/workdir/external/gammasim-tools/simtools/ray_tracing.py", line 198, in fromKwargs
    return cls(**args, configData=configData)
  File "/workdir/external/gammasim-tools/simtools/ray_tracing.py", line 126, in __init__
    _parameterFile = io.getDataFile("parameters", "ray-tracing_parameters.yml")
  File "/workdir/external/gammasim-tools/simtools/io_handler.py", line 187, in getDataFile
    Path(cfg.get("dataLocation")).joinpath(parentDir).joinpath(fileName).absolute()
  File "/workdir/external/gammasim-tools/simtools/config.py", line 85, in get
    raise KeyError()
KeyError
INFO::db_handler(l187)::_closeSSHTunnel::Closing SSH tunnel(s)

It doesn't find a keyword in a config.yml, but most of the error message is about a database connection and ssh issues.

The error is as clear as we can make it. It is a bit confusing because of python. The large block about KeyError is inevitable, and the one line about the DB is because we close the connection after the error. What we can do is to make the line "ERROR::config(l84)::get::Config does not contain dataLocation" a bit clearer as you suggested. I will create a issue for that and fix it next.

Why can't the program exit in a clean way after there is a configuration file has not been found? This would be much clearer.

RaulRPrado commented 2 years ago

Note that the full error message here is very confusing:

python gammasim-tools/applications/derive_mirror_rnda.py --site North --telescope MST-FlashCam-D --mean_d80 1.4 --sig_d80 0.16 --mirror_list mirror_MST_focal_lengths.dat --d80_list mirror_MST_D80.dat --rnda 0.0075
maierg@warp.zeuthen.desy.de's password:
ERROR::config(l84)::get::Config does not contain dataLocation
Traceback (most recent call last):
  File "gammasim-tools/applications/derive_mirror_rnda.py", line 305, in <module>
    meanD80, sigD80 = run(rndaStart)
  File "gammasim-tools/applications/derive_mirror_rnda.py", line 235, in run
    ray = RayTracing.fromKwargs(
  File "/workdir/external/gammasim-tools/simtools/ray_tracing.py", line 198, in fromKwargs
    return cls(**args, configData=configData)
  File "/workdir/external/gammasim-tools/simtools/ray_tracing.py", line 126, in __init__
    _parameterFile = io.getDataFile("parameters", "ray-tracing_parameters.yml")
  File "/workdir/external/gammasim-tools/simtools/io_handler.py", line 187, in getDataFile
    Path(cfg.get("dataLocation")).joinpath(parentDir).joinpath(fileName).absolute()
  File "/workdir/external/gammasim-tools/simtools/config.py", line 85, in get
    raise KeyError()
KeyError
INFO::db_handler(l187)::_closeSSHTunnel::Closing SSH tunnel(s)

It doesn't find a keyword in a config.yml, but most of the error message is about a database connection and ssh issues.

The error is as clear as we can make it. It is a bit confusing because of python. The large block about KeyError is inevitable, and the one line about the DB is because we close the connection after the error. What we can do is to make the line "ERROR::config(l84)::get::Config does not contain dataLocation" a bit clearer as you suggested. I will create a issue for that and fix it next.

Why can't the program exit in a clean way after there is a configuration file has not been found? This would be much clearer.

I will try to get rid of this big block about the KeyError, but I'm not sure whether that is possible. To me this message looks like a pretty common python error. And it is actually more clear than many other packages because we added the customized mesage about the config file, that is not hard to find. Other packages would choose to leave the error handling to python, which means there would be a standard KeyError message only.

orelgueta commented 2 years ago

And we can remove the SSH tunnel message if you find it distracting (make it a DEBUG level).

GernotMaier commented 2 years ago

It simply doesn't make sense to have these type of error and confuse the users. I strongly suggest to have clear error messages, catch those errors early and exit without the python errors.

Error messages should be helpful. And here we don't really have an exception, it is a file not found, we can handle it and exit.

GernotMaier commented 2 years ago

On the issue of the 'data' directory: I don't want to start a discussion about directory names, but I think it would have been clearer if it wouldn't be called data but maybe configuration?

GernotMaier commented 2 years ago

Seeing the full output of this application, I get now the following error at the end.

Is the connection the DB lost when running an application which takes several minutes? This is running from home on my laptop.

INFO::ray_tracing(l227)::simulate::Simulating RayTracing for offAxis=0.0, mirror=170
INFO::simtel_runner(l173)::run::Running (1x) with command:/workdir/sim_telarray/sim_telarray/bin/sim_telarray -c /workdir/external/output/simtools-output/derive_mirror_rnda/model/CTA-North-MST-FlashCam-D-prod4_derive_mirror_rnda.cfg -I../cfg/CTA -C IMAGING_LIST=/workdir/external/output/simtools-output/derive_mirror_rnda/ray-tracing/photons-North-MST-FlashCam-D-d0.0-za20.0-off0.000_mirror170_derive_mirror_rnda.lis -C stars=/workdir/external/output/simtools-output/derive_mirror_rnda/ray-tracing/stars-North-MST-FlashCam-D-d0.0-za20.0-off0.000_mirror170_derive_mirror_rnda.lis -C altitude=2150.0 -C telescope_theta=20.0 -C star_photons=10000 -C telescope_phi=0 -C camera_transmission=1.0 -C nightsky_background=all:0. -C trigger_current_limit=1e10 -C telescope_random_angle=0 -C telescope_random_error=0 -C convergent_depth=0 -C maximum_telescopes=1 -C show=all -C camera_filter=none -C focus_offset=all:0. -C camera_config_file=single_pixel_camera.dat -C camera_pixels=1 -C trigger_pixels=1 -C camera_body_diameter=0 -C mirror_list=/workdir/external/output/simtools-output/derive_mirror_rnda/model/CTA-single-mirror-list-North-MST-FlashCam-D-prod4-mirror170_derive_mirror_rnda.dat -C focal_length=3214.0 -C dish_shape_length=1607.0 -C mirror_focal_length=1607.0 -C parabolic_dish=0 -C mirror_align_random_distance=0. -C mirror_align_random_vertical=0.,28.,0.,0. /workdir/sim_telarray/run9991.corsika.gz 2>&1 > /workdir/external/output/simtools-output/derive_mirror_rnda/ray-tracing/log-North-MST-FlashCam-D-d0.0-za20.0-off0.000_mirror170_derive_mirror_rnda.log 2>&1
INFO::ray_tracing(l227)::simulate::Simulating RayTracing for offAxis=0.0, mirror=171
client_loop: send disconnect: Broken pipe
Traceback (most recent call last):
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1514, in _retryable_read
    server = self._select_server(
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1346, in _select_server
    server = topology.select_server(server_selector)
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/pymongo/topology.py", line 244, in select_server
    return random.choice(self.select_servers(selector,
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/pymongo/topology.py", line 202, in select_servers
    server_descriptions = self._select_servers_loop(
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/pymongo/topology.py", line 218, in _select_servers_loop
    raise ServerSelectionTimeoutError(
pymongo.errors.ServerSelectionTimeoutError: localhost:27018: [Errno 111] Connection refused, Timeout: 30s, Topology Description: <TopologyDescription id: 6200d849ebe36888bd560be0, topology_type: Single, servers: [<ServerDescription ('localhost', 27018) server_type: Unknown, rtt: None, error=AutoReconnect('localhost:27018: [Errno 111] Connection refused')>]>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "gammasim-tools/applications/derive_mirror_rnda.py", line 311, in <module>
    meanD80, sigD80 = run(newRnda)
  File "gammasim-tools/applications/derive_mirror_rnda.py", line 241, in run
    ray.simulate(test=False, force=True)  # force has to be True, always
  File "/workdir/external/gammasim-tools/simtools/ray_tracing.py", line 232, in simulate
    simtel = SimtelRunnerRayTracing(
  File "/workdir/external/gammasim-tools/simtools/simtel/simtel_runner_ray_tracing.py", line 119, in __init__
    self._loadRequiredFiles()
  File "/workdir/external/gammasim-tools/simtools/simtel/simtel_runner_ray_tracing.py", line 154, in _loadRequiredFiles
    "# configFile = {}\n".format(self.telescopeModel.getConfigFile())
  File "/workdir/external/gammasim-tools/simtools/model/telescope_model.py", line 525, in getConfigFile
    self.exportConfigFile()
  File "/workdir/external/gammasim-tools/simtools/model/telescope_model.py", line 502, in exportConfigFile
    self.exportModelFiles()
  File "/workdir/external/gammasim-tools/simtools/model/telescope_model.py", line 496, in exportModelFiles
    db.exportModelFiles(parsFromDB, self._configFileDirectory)
  File "/workdir/external/gammasim-tools/simtools/db_handler.py", line 265, in exportModelFiles
    self._writeFileFromMongoToDisk(
  File "/workdir/external/gammasim-tools/simtools/db_handler.py", line 685, in _writeFileFromMongoToDisk
    fsOutput.download_to_stream_by_name(file.filename, outputFile)
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/gridfs/__init__.py", line 910, in download_to_stream_by_name
    for chunk in gout:
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/gridfs/grid_file.py", line 802, in next
    chunk = self.__chunk_iter.next()
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/gridfs/grid_file.py", line 755, in next
    chunk = self._next_with_retry()
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/gridfs/grid_file.py", line 747, in _next_with_retry
    return self._cursor.next()
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/pymongo/cursor.py", line 1238, in next
    if len(self.__data) or self._refresh():
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/pymongo/cursor.py", line 1155, in _refresh
    self.__send_message(q)
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/pymongo/cursor.py", line 1044, in __send_message
    response = client._run_operation(
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1424, in _run_operation
    return self._retryable_read(
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1531, in _retryable_read
    raise last_error
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1525, in _retryable_read
    return func(session, server, sock_info, secondary_ok)
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1420, in _cmd
    return server.run_operation(
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/pymongo/server.py", line 114, in run_operation
    reply = sock_info.receive_message(request_id)
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/pymongo/pool.py", line 753, in receive_message
    self._raise_connection_failure(error)
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/pymongo/pool.py", line 751, in receive_message
    return receive_message(self, request_id, self.max_message_size)
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/pymongo/network.py", line 215, in receive_message
    data = _receive_data_on_socket(sock_info, length - 16, deadline)
  File "/conda/envs/gammasim-tools-dev/lib/python3.8/site-packages/pymongo/network.py", line 293, in _receive_data_on_socket
    raise AutoReconnect("connection closed")
pymongo.errors.AutoReconnect: connection closed
INFO::db_handler(l187)::_closeSSHTunnel::Closing SSH tunnel(s)
orelgueta commented 2 years ago

Interesting, never seen this issue. I wonder if we are closing the connection after the first mirror. Raul mentioned he found a bug though, so let's wait and see if that is the issue or an actual connection to the DB issue.

RaulRPrado commented 2 years ago

It is working normally with me, after I fixed a bug.

GernotMaier commented 2 years ago

Let me check this on the WGS. At home on my laptop, it shows this error after a while. Main reason is probably some time out of the ssh connection.

orelgueta commented 2 years ago

If that's the case, I will have to modify the tunnel or DB connection parameters so that it doesn't happen. Let me know and I will try to recreate then.

GernotMaier commented 2 years ago

The error message is

pymongo.errors.AutoReconnect: connection closed

Looking around, it seems worth trying to keep the connection alive, see e.g. here

orelgueta commented 2 years ago

It could also be an issue of trying to open a second connection when running the second mirror (instead of using the original connection). I will try to reproduce and debug.

orelgueta commented 2 years ago

Couldn't recreate, neither from the WGS nor from my laptop using the usual dev container. Running on the WGS took about 10 minutes, running on the laptop took significantly longer (1.5-2 hours). In neither case did I get this issue.

However, following your error message, I do see some strange behaviour in the code. If I understand correctly (@RaulRPrado can correct me if I am wrong), every time we run the ray tracing on one mirror, we read the model from the DB and export a config file. Not sure why we do that for every mirror. I assume the entire config file does not change but instead we change just one/two parameters in it. Why don't we export the file once and then, if necessary, edit the exported one prior to each run? Alternatively, we not export the model we already read once and have in memory (i.e., dict)?

I am pretty sure that reading from the DB for every run makes the execution much slower, especially when running from home. Should we open an issue to improve this behaviour?

GernotMaier commented 2 years ago

Note that on your laptop, the kerberos ticketing is probably working (it stopped working probably after an update on mine, and I didn't have time to fix it). Maybe this is the difference - I would suggest not to dig further.

Anything improving efficiency is good in my opinion.

orelgueta commented 2 years ago

Actually no, the token did not work. I think that the update to Monterey modified the way OSX saves the krb token. Spent 15 minutes trying to fix it and gave up. So our setup was equivalent.

However, I can imagine that keeping a tunnel open for 2 hours could cause an issue. We can both try to extend the timeout and to make the code more efficient. The former requires help from the DB manager. The latter I will try to figure out in the next few days (unless @RaulRPrado has a reason why it cannot be done).

RaulRPrado commented 2 years ago

Couldn't recreate, neither from the WGS nor from my laptop using the usual dev container. Running on the WGS took about 10 minutes, running on the laptop took significantly longer (1.5-2 hours). In neither case did I get this issue.

However, following your error message, I do see some strange behaviour in the code. If I understand correctly (@RaulRPrado can correct me if I am wrong), every time we run the ray tracing on one mirror, we read the model from the DB and export a config file. Not sure why we do that for every mirror. I assume the entire config file does not change but instead we change just one/two parameters in it. Why don't we export the file once and then, if necessary, edit the exported one prior to each run? Alternatively, we not export the model we already read once and have in memory (i.e., dict)?

I am pretty sure that reading from the DB for every run makes the execution much slower, especially when running from home. Should we open an issue to improve this behaviour?

That should be easy to fix. I will work on it.

RaulRPrado commented 2 years ago

I will create another issue for that, so you can close this one.

orelgueta commented 2 years ago

OK, I am closing this. The underlying issue of prolonging the timeout of the DB might still be there, but if we fix this issue we might never encounter it again and maybe increasing the timeout unnecessarily isn't a good idea.