Removing self-energies is costly

chrisiacovella commented 1 month ago

As the title suggests, the process of removing the self-energies from a dataset can be prohibitively expensive with the current implementation as the size of the dataset grows. E.g., QM9 is small and only takes about 90 seconds, whereas ANI2X takes 2 and half hours.

Since most of the datasets are composed of many conformers that will have identical self energies, the simplest way to speed this up will be to just add a dictionary and avoid recalculating if we have already seen a molecule with the same atomic numbers.

Implementing this takes ANI2X down about 20 minutes which I think is perfectly manageable. I think it would still be good to avoid having to redo this calculation at all, by writing this out (instead of writing to local_cache_dir, I think it would be better to be "local_working_dir"...where local_cache_dir is for the raw datasets that we may be sharing among calculations , and local_working_dir is where we pieces specific to the calculation itself) . There already is a _to_cache function in DataModule.prepare_data that writes out the .pt file with the self energies removed, however, this has fixed file name for all datasets ( so we could run into some issues) and being saved to the execution directly, not to any sort of specified directory (working or cache).

If we skip calling prepare_data and just call setup, it looks like it will use the previously written .pt file, but I think we might need to lightly revamp the flow those functions to possible automatically check if the .pt file exists (with options to overwrite, also with appropriate metadata like we do for the .npz file). Does it make sense to add another layer in the data loading step to check if there is a .pt file (and check metadata), before looking for the .npz file (then, looking for .hdf5, then looking for .hdf5.gz).

I'm going to push a PR with the simple change to the self energy removal to using a dictionary to speed things up; we ca discuss and address the other aspects separately.

wiederm commented 1 month ago

Thanks for raising this! For a dataset like SPICE, we should aim for a < 5 min preprocessing cost if possible (assuming a multicore CPU with 4 cores). I think caching is a great idea --- do you think we can use the cache decorator/method from functools?

wiederm commented 1 month ago

To bring this up here in the issue, this is the function that we want to speed up:

for i in range(len(dataset)):
    atomic_numbers = list(dataset[i]["atomic_numbers"])
    E = dataset[i]["E"]
    for Z in atomic_numbers:
        E -= self_energies[int(Z)]
        dataset[i] = {"E": E}

wiederm commented 1 month ago

We have made this same calculation much faster in the postprocessing pipeline, where we created a PyTorch tensor with the atomic self-energies, used Z to index this array, and performed a scatter_sum operation. We can do this here with numpy. This will get rid of the for loop over the atomic numbers and should further improve speed.

We should also think about rewriting this using the mutliprocessing library. Something like:

from multiprocessing import Pool
with Pool(4) as p:
    p.map(prep, dataset)

There is a bit of complexity here: ideally, we want a shared memory object for which we can call the __setter__ in different processes. An alternative might be to prepare a batch of energies in each of the processes and, after $N$ calculations, return to the main process and set the new values. I will profile the timing of the individual parts of this function to understand this a bit Pyetter.

Extending on the shared memory object: we can share PyTorch tensors using the PyTorch multiprocessing drop-in replacement for multiprocessing. This puts tensors in a shared memory region and we can directly access the energy values. I am happy to take a stab at this!

wiederm commented 1 month ago

I took a look at the changes you made in the code in the associated PR #121.

I'd suggest to change the dictionary lookup (the for loop and the lookup should be O(1|N), but the overhead might be more than indexing in the numpy array): Instead of passing the self_energies dictionary we can pass the index array:

# convert dictionary to index array
index_array = np.zeros(101)
for v, k in self_energies.items():
    index_array[int(v)] = k

and then we can sum over the atomic energies like this:

atomic_numbers = dataset[i]["atomic_numbers"]
index_array[atomic_numbers].sum()

This will remove multiple casting operations and the for loop.

I can't modify the PR directly since it's a PR from a fork, but I am happy to outline this more in detail if things are unclear.

wiederm commented 1 month ago

If we skip calling prepare_data and just call setup, it looks like it will use the previously written .pt file, but I think we might need to lightly revamp the flow those functions to possible automatically check if the .pt file exists (with options to overwrite, also with appropriate metadata like we do for the .npz file). Does it make sense to add another layer in the data loading step to check if there is a .pt file (and check metadata), before looking for the .npz file (then, looking for .hdf5, then looking for .hdf5.gz).

Yes, I think we want to skip prepare_data if we have already prepared the .pt file. The best way to achieve this is that prepare_data checks if the file exists, and if it does, it will skipt the additional preprocessing steps. I would put the control flow for skipping in prepare_data, because the Trainer calls both prepare_data and setup internally.

chrisiacovella commented 1 month ago

I took a look at the changes you made in the code in the associated PR #121.

I'd suggest to change the dictionary lookup (the for loop and the lookup should be O(1|N), but the overhead might be more than indexing in the numpy array): Instead of passing the self_energies dictionary we can pass the index array:
# convert dictionary to index array
index_array = np.zeros(101)
for v, k in self_energies.items():
    index_array[int(v)] = k
and then we can sum over the atomic energies like this:
atomic_numbers = dataset[i]["atomic_numbers"]
index_array[atomic_numbers].sum()
This will remove multiple casting operations and the for loop.

I can't modify the PR directly since it's a PR from a fork, but I am happy to outline this more in detail if things are unclear.

This is a great suggestion for further improvement. My first goal in PR #121 was to just get the this bit of code fast enough that I could reasonably submit a training job, as the 3 hours it was taking was not feasible!

Extending on the shared memory object: we can share PyTorch tensors using the PyTorch [multiprocessing]>(https://pytorch.org/docs/stable/notes/multiprocessing.html) drop-in replacement for multiprocessing. This puts tensors >in a shared memory region and we can directly access the energy values. I am happy to take a stab at this!

I would prefer to not call the multiprocessing module if we can avoid it. As I said, I think better handling of the cached file would also be very beneficial from a usability standpoint.

chrisiacovella commented 1 month ago

The PR that was just merged uses the ase_vector_for_indexing tensor that was already defined the self energies class (rather than having to locally created an array), keeping everything in PyTorch. I tried various tweaks, e.g., padding and batching the conformers so we'd do a single computation on a 2d tensor rather than a bunch of 1d computations, but the extra time associated with padding seemed to offset any improvements (or any consistent improvements).

Data caching (either using a dictionary or functools) also provided no improvements, likely because the actual calculation of an individual record is so fast when doing this as a vector/tensor type operation.

@wiederm mentioned that the getitem routine may be a bottleneck. I did some quick profiling as well, and just did a quick rewrite where I just directly access the atomic_numbers from the dataset rather than using indexing which returns a ton of properties.

This takes the time from qm9 down from about 1.5 minutes to 15 seconds. SPICE2 goes from about 6-7 minutes to <3 minutes. ANI2x goes from 20 minutes down to <13.

I'm going to make a new PR with this small change.

It might be worthwhile to add in some additional accessor methods to get specific properties with less overhead.

chrisiacovella commented 1 month ago

The time was further reduced when also accessing energy directly ( missed that in the first round of edits above). SPICE2 is now 54 seconds, ANI2X, 4:36 in my last test, which is a big improvement and should make this much more usable. We likely need to work through the code and switch from using __getitem__ (keeping it just as a useful helper, but not in performant situations). I am going to set up some tests to see the performance of direct access vs. stripped down getter functions (e.g., that only access a single property).

chrisiacovella commented 1 month ago

Ok a few notes/thoughts:

1- The main reason we get such substantially speed improvements for directly indexing into the dataset is that __getitem__ ends up accessing and returning a lot of information we ultimately do not use in this context (we only need energy and atomic numbers, do not need coordinates or forces/charges, etc., if present).

2- Indexing directly into the array also avoids making copies of the data stored in the dataset. The __getitem__ function makes a copy in all cases using .clone().detach(). If I were to make a function called get_energy that basically mimics what is done for energy in __getitem__ (i.e., return self.properties_of_interest["e"][idx].clone().detach().to(torch.float64), there is a slow down of about 10% compared to direct access, associated with copying.

So I think there are 2 things to consider:

Does it make sense to create some accessor methods that only return the data for a specific property (as this would make for an easier interface) in cases when we do not need everything in a record (or have a function that just takes keywords and the index).
In some cases we want copies of the data in other cases just can just to directly access data. How do we want to handle each scenario? If we are just accessing data, should we just directly index? If we need copies should we use helper functions? Should we just ditch the idea of a helper function altogether within the performance oriented parts of the code and just leave the accessor functions for having a nicer forward facing API?

wiederm commented 1 month ago

I think the main problem (and I have addressed this in part in PR #126 is that .clone().detach() is called for each element of the tensor. THe overhead is greatly diminished if it is called for the tensor, or, if it is omitted (I don't see a reason why we want to clone().detach().

We are also casting elements of tensors to dtypes, we should do this for the tensor itself (again, partly addressed in PR #126 )

chrisiacovella commented 1 month ago

I think we've fully addressed this. There may be additional ways to improve this and other operations, but this is obviously not longer a bottleneck.

choderalab / modelforge

Removing self-energies is costly #120