InfluenceFunctional / MXtalTools

BSD 3-Clause "New" or "Revised" License
8 stars 1 forks source link

crash error in new RDF calculation #72

Closed InfluenceFunctional closed 1 year ago

InfluenceFunctional commented 1 year ago

Traceback (most recent call last): File "/scratch/mk8347/mcrygan/main.py", line 31, in predictor.train_crystal_models() File "/scratch/mk8347/mcrygan/crystal_modeller.py", line 557, in train_crystal_models raise e # will simply raise error if training on CPU File "/scratch/mk8347/mcrygan/crystal_modeller.py", line 497, in train_crystal_models self.run_epoch(epoch_type='train', data_loader=train_loader, File "/scratch/mk8347/mcrygan/crystal_modeller.py", line 573, in run_epoch self.gan_epoch(data_loader, update_gradients, iteration_override) File "/scratch/mk8347/mcrygan/crystal_modeller.py", line 636, in gan_epoch self.discriminator_step(data, i, update_gradients, skip_step=skip_discriminator_step) File "/scratch/mk8347/mcrygan/crystal_modeller.py", line 810, in discriminator_step = self.get_discriminator_output(data, i) File "/scratch/mk8347/mcrygan/crystal_modeller.py", line 964, in get_discriminator_output crystal_multiplicity=fake_supercell_data.mult) File "/scratch/mk8347/mcrygan/models/crystal_rdf.py", line 178, in new_crystal_rdf dists_per_hist, sorted_dists, rdfs_dict = get_elementwise_dists(crystaldata, edges, dists, device, num_graphs, edge_in_crystal_number) File "/scratch/mk8347/mcrygan/models/crystal_rdf.py", line 265, in get_elementwise_dists sorted_dists = dists.repeat(num_graphs * num_pairs, 1)[bool_list] RuntimeError: nonzero is not supported for tensors with more than INT_MAX elements, file a support request

InfluenceFunctional commented 1 year ago

absolutely no idea what's going on here but its consistent between several runs after a long enough time. Could be an issue with distorted or random samples having a particularly bad property - extremely large density maybe?

InfluenceFunctional commented 1 year ago

alternatively extremely diffuse? (dists is empty) though this shouldn't be possible given the adaptive convolution cluster builder

InfluenceFunctional commented 1 year ago

a little googling suggests it's a general issue for large batches

InfluenceFunctional commented 1 year ago

qualitatively seems like runs are crashing in the range 100-150, but not dispositive

InfluenceFunctional commented 1 year ago

slurm-39492915_7.out:RuntimeError: nonzero is not supported for tensors with more than INT_MAX elements, file a support request slurm-39493369_7.out:RuntimeError: nonzero is not supported for tensors with more than INT_MAX elements, file a support request slurm-39511538_7.out:RuntimeError: nonzero is not supported for tensors with more than INT_MAX elements, file a support request

all crashed at batch_size=179, though they were very similar runs in general - rerun with capped batch size

InfluenceFunctional commented 1 year ago

add this to our batch sizing except statement