BorgwardtLab / proteinshake

Protein structure datasets for machine learning.
https://proteinshake.ai
BSD 3-Clause "New" or "Revised" License
99 stars 8 forks source link

tm score caching, word size adjust, rename distance to similarity #133

Closed cgoliver closed 1 year ago

cgoliver commented 1 year ago
import tempfile
from proteinshake.datasets import RCSBDataset

n_jobs = 4
with tempfile.TemporaryDirectory() as tmp:
    ds = RCSBDataset(root=tmp,
                    use_precomputed=False,
                    n_jobs=n_jobs,
                    cluster_structure=True,
                    cluster_sequence=True,
                    similarity_threshold_structure=[0.9, 0.8, 0.7, 0.6],
                    similarity_threshold_sequence=[0.9, 0.8, 0.7, 0.6]
                    )
cgoliver commented 1 year ago

Weird I am not getting the warning. Also not sure I understand the collision.. it seems to point to the same file wrappers.py at lines -1 and 14..

timkucera commented 1 year ago

Figured it only happens with n_jobs > 1. According to the docs the cache is indexed with the functions name. Might be that when the function is copied to the other threads the cache is created multiple times (which would be pretty ironic since it's a joblib function...). They also state there is no collision when this is happening in the same session, but a warning is issued. I don't quite understand what's the behaviour in this case, and if the data from all threads is properly recovered.

cgoliver commented 1 year ago

I see.. Well I only define the function with that name once so I'm not getting where the collision would happen. So still not sure how to fix it. I will try instead to use the decorator, I wasn't aware of that.

Have a blessed Christmas.

Sent from ProtonMail mobile

-------- Original Message -------- On Dec. 24, 2022, 9:51 a.m., Tim Kucera < @.***> wrote:

Figured it only happens with n_jobs > 1. According to the docs the cache is indexed with the functions name. Might be that when the function is copied to the other threads that the cache is created multiple times (which would be pretty ironic since it's a joblib function...). They also state there is no collision when this is happening in the same session, but a warning is issued. I don't quite understand what's the behaviour in this case, and if the data from all threads is properly recovered.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.AAUDVIYWOKVYNDQ2RM2TSE3WO22PJA5CNFSM6AAAAAATHZNDKGWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTSRKR3VQ.gifMessage ID: @.***>

timkucera commented 1 year ago

I'm removing the cache feature for now because of undefined behaviour, to be revisited later. I will merge and start the release.

timkucera commented 1 year ago

PS: I tried the decorator, same problem