Closed InfluenceFunctional closed 1 year ago
absolutely no idea what's going on here but its consistent between several runs after a long enough time. Could be an issue with distorted or random samples having a particularly bad property - extremely large density maybe?
alternatively extremely diffuse? (dists is empty) though this shouldn't be possible given the adaptive convolution cluster builder
a little googling suggests it's a general issue for large batches
qualitatively seems like runs are crashing in the range 100-150, but not dispositive
slurm-39492915_7.out:RuntimeError: nonzero is not supported for tensors with more than INT_MAX elements, file a support request slurm-39493369_7.out:RuntimeError: nonzero is not supported for tensors with more than INT_MAX elements, file a support request slurm-39511538_7.out:RuntimeError: nonzero is not supported for tensors with more than INT_MAX elements, file a support request
all crashed at batch_size=179, though they were very similar runs in general - rerun with capped batch size
add this to our batch sizing except
statement
Traceback (most recent call last): File "/scratch/mk8347/mcrygan/main.py", line 31, in
predictor.train_crystal_models()
File "/scratch/mk8347/mcrygan/crystal_modeller.py", line 557, in train_crystal_models
raise e # will simply raise error if training on CPU
File "/scratch/mk8347/mcrygan/crystal_modeller.py", line 497, in train_crystal_models
self.run_epoch(epoch_type='train', data_loader=train_loader,
File "/scratch/mk8347/mcrygan/crystal_modeller.py", line 573, in run_epoch
self.gan_epoch(data_loader, update_gradients, iteration_override)
File "/scratch/mk8347/mcrygan/crystal_modeller.py", line 636, in gan_epoch
self.discriminator_step(data, i, update_gradients, skip_step=skip_discriminator_step)
File "/scratch/mk8347/mcrygan/crystal_modeller.py", line 810, in discriminator_step
= self.get_discriminator_output(data, i)
File "/scratch/mk8347/mcrygan/crystal_modeller.py", line 964, in get_discriminator_output
crystal_multiplicity=fake_supercell_data.mult)
File "/scratch/mk8347/mcrygan/models/crystal_rdf.py", line 178, in new_crystal_rdf
dists_per_hist, sorted_dists, rdfs_dict = get_elementwise_dists(crystaldata, edges, dists, device, num_graphs, edge_in_crystal_number)
File "/scratch/mk8347/mcrygan/models/crystal_rdf.py", line 265, in get_elementwise_dists
sorted_dists = dists.repeat(num_graphs * num_pairs, 1)[bool_list]
RuntimeError: nonzero is not supported for tensors with more than INT_MAX elements, file a support request