aim2dat / aim2dat

Repository of the aim2dat python package.
https://aim2dat.github.io/aim2dat/
GNU Lesser General Public License v2.1
3 stars 2 forks source link

Calculating `f-fingerprints` in advance seems to be slower #70

Open t-reents opened 3 weeks ago

t-reents commented 3 weeks ago

I'm working on implementing a parallelized version of the comparison methods and started with the f-fingerprint on a local branch: https://github.com/t-reents/aim2dat/commit/1760619dcecca4203dead17032cf8f5fbb295c5c

The changes generally work. However, I realized that the following part is somehow slowing down the comparison:

https://github.com/aim2dat/aim2dat/blob/1952d240c726266c916a2f2383288a007ac79ff0/aim2dat/strct/structure_operations.py#L433-L435

As it should speed up the execution (all fingerprints are calculated in advance and could be reused, especially when parallelizing), I also implemented it in the new version for the comparison. Thereby, I noticed that it seems to slow things down, also in case of find_duplicates_via_ffingerprint.

You can simply reproduce this behaviour by commenting out the lines shown above and test the find_duplicates_via_ffingerprint with and without. Important to note, this can be done on a local copy of main and is not related to my changes. It would actually be good to see whether you are able to reproduce it or whether I did something stupid. So far, I couldn't fully understand why this behavior is observed.

hdsassnick commented 3 weeks ago

Ok, that is a bit weird. I actually tested this feature when implementing it.. Do you have store_calculated_properties set to True on the Structures? It may also be related to RAM or the number of structures?

I can do some testing next week..

t-reents commented 3 weeks ago

Indeed, I was also confused. Yes, store_calculated_properties is True, as it's also the default.

It may also be related to RAM or the number of structures?

I think I had around 300-400 structures in that example (I'd need to check again)

hdsassnick commented 2 weeks ago

ok, I can actually reproduce the issue. In case of n_proc=1, there should be hardly a difference. Since the structure comparison method is taking the pre-calculated property in any case.

For n_proc>1 the problem should be that an object altered by one processor cannot be shared with the others. That is why pre-calculation should be quicker as all processes access the pre-calculated properties. Maybe, getting a deepcopy of extras is taking more time than recalculating it.. Also, due to the formula check from #61 less structures are actually compared.

In any case, if pre-calculation is not giving an advantage we can omit it in the redesign.

t-reents commented 2 weeks ago

Yes, makes sense