IndexError: list index out of range from get_restart_pdb

Describe the bug I am reporting this first, there are 12 items in the list but the index seeks more than the actual size, e.g., 12, 13, 14, ... which generates the error:

Traceback (most recent call last):
  File "/jet/home/sanjrani/DeepDriveMD-pipeline/deepdrivemd/sim/openmm/run_openmm.py", line 215, in <module>
    run_simulation(cfg)
  File "/jet/home/sanjrani/DeepDriveMD-pipeline/deepdrivemd/sim/openmm/run_openmm.py", line 167, in run_simulation
    ctx = SimulationContext(cfg)
  File "/jet/home/sanjrani/DeepDriveMD-pipeline/deepdrivemd/sim/openmm/run_openmm.py", line 28, in __init__
    self._init_workdir()
  File "/jet/home/sanjrani/DeepDriveMD-pipeline/deepdrivemd/sim/openmm/run_openmm.py", line 67, in _init_workdir
    self._pdb_file = self._get_pdb_file()
  File "/jet/home/sanjrani/DeepDriveMD-pipeline/deepdrivemd/sim/openmm/run_openmm.py", line 80, in _get_pdb_file
    outlier = self.api.get_restart_pdb(self.cfg.task_idx, self.cfg.stage_idx - 1)
  File "/jet/home/sanjrani/anaconda3/envs/conda-entk/lib/python3.7/site-packages/deepdrivemd/data/api.py", line 229, in get_restart_pdb
    print(data[index])
IndexError: list index out of range

To Reproduce Steps to reproduce the behavior:

HPC Platform PSC Bridges2
Link to YAML configuration file https://github.com/DeepDriveMD/DeepDriveMD-pipeline/blob/feature/psc_bridges2/experiment/deepdrivemd_bridges.yaml
Commands run ...

Failed command

"/jet/home/sanjrani/DeepDriveMD-pipeline/deepdrivemd/sim/openmm/run_openmm.py" 
"-c" "/jet/home/sanjrani/test_sim_5/molecular_dynamics_runs/stage0001/task0014/stage0001_task0014.yaml"

Expected behavior Check data size (if index < len(data)) before locating an index in the list. or warning message to adjust incorrect parameters in the yaml configuration.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context 1st iteration was finished successfully (16 num_tasks), and this error happened in the middle of 2nd iteration.

do these num_extrinsic_outliers, n_most_recent_h5_files, k_random_old_h5_files require to match to num_tasks from simulations? because I see 16 num_tasks whereas 12 on the other three.

Yes, num_extrinsic_outliers should be greater than or equal to num_tasks. num_intrinsic_outliers are selected using the AI based approach, then from those num_extrinsic_outliers are taken as the "best" outliers which are used to restart new simulations.

For this particular bug, try setting these parameters:

num_extrinsic_outliers: 16
n_most_recent_h5_files: 16
k_random_old_h5_files: 16

Hopefully this explanation helps to clarify other aspects:

Each MD simulation runs for simulation_length_ns ns and reports a frame every report_interval_ps ps, and its output is simulation_length_ns ns (1000 ps/ns) / (report_interval_ps ps) frames. So 1 ns 1000 / 1ps = 1000 output frames per MD sim. Those are the frames that get written to HDF5.

n_traj_frames in the Agent config should be set to 1000 in this case since it is the number of frames in each h5 file.

n_most_recent_h5_files should most likely be set to the number of MD sims num_tasks since it is responsible for gathering the most recent simulation data for inference and outlier detection.

If n_most_recent_h5_files is set less than num_tasks, then the agent will be sub optimal since it may be missing out on outlying states. If it's set greater than the number of MD sims, it may have an index error.

num_intrinsic_outliers should be set greater than or equal to num_tasks. It will select the number of outliers determined in a purely unsupervised way. This should be at least equal to the number of MD sims since it is used to create the restart points for the next round of simulations.

num_extrinsic_outliers comes into play when extrinsic_score is active. This one further prunes the intrinsic outliers using some biophysical scoring function. If extrinsic_score is nan then it will simply take the first num_extrinsic_outliers elements of the intrinsic outlier array. It should be greater than or equal to num_tasks.

k_random_old_h5_files sets the number of old data files to look at which is helpful for outlier detection and maintaining the previously sampled data distribution. I don't think this parameter can cause an index error though.

We can add this explanation to the documentation since it is a really common configuration error.

DeepDriveMD / DeepDriveMD-pipeline

IndexError: list index out of range from get_restart_pdb #34