AllenInstitute / AllenSDK

code for reading and processing Allen Institute for Brain Science data
https://allensdk.readthedocs.io/en/latest/
Other
340 stars 149 forks source link

Provide utility to connect a new CloudCache to an old cache_dir #2231

Open danielsf opened 3 years ago

danielsf commented 3 years ago

An existing cache_dir created with data v0.3.0 fails to instantiate with AllenSDK v2.12.1 which requires a minimum data version of v1.0.0. We should provide a message/parameter to allow the user to bootstrap out of this situation into an up-to-date cache. Current solution: start a whole new cache_dir.

The failure is related to metadata files that were added to the dataset in v1.0.0 but did not exist as of v0.3.0. On instantiation, the CloudCache checks to see if all of the expected metadata files are present. It fails if they are not. The correct thing to do is probably to just download the expected metadata.

matchings commented 3 years ago

This has been a problem for many internal and external users. From my perspective, the minimum fix would be to provide some documentation and/or information in the error message that notifies users why this is happening and what to do to resolve it. like saying "your sdk version is incompatible with this dataset version - update allenSDK to x.x.x or use a new cache_dir to pull the latest version of the dataset"

danielsf commented 2 years ago

In order to reproduce the bug, run

from allensdk.brain_observatory.behavior.behavior_project_cache import (
    VisualBehaviorOphysProjectCache)

import os

cache_dir = '/allen/aibs/informatics/danielsf/sdk_cache_210809'
assert os.path.isdir(cache_dir)

cache = VisualBehaviorOphysProjectCache.from_s3_cache(cache_dir=cache_dir)

you should get

    raise RuntimeError("expected S3CloudCache object to have "
RuntimeError: expected S3CloudCache object to have metadata file names: {'ophys_experiment_table', 'behavior_session_table', 'ophys_cells_table', 'ophys_session_table'} but it has {'ophys_experiment_table', 'behavior_session_table', 'ophys_session_table'}

Note: this works because that specific cache_dir points to a cache that predates the ophys_cells table.

This failure arises because you are pointing a new version of the SDK at an old project cache. There should be a way to do that safely.

danielsf commented 2 years ago

Alternatively, if you checkout v2.11.0 and attempt to create a new cache with

from allensdk.brain_observatory.behavior.behavior_project_cache import (
    VisualBehaviorOphysProjectCache)

import os

cache_dir = '/allen/aibs/informatics/danielsf/sdk_cache_211021'
assert not os.path.isdir(cache_dir)

cache = VisualBehaviorOphysProjectCache.from_s3_cache(cache_dir=cache_dir)

you get

RuntimeError: expected S3CloudCache object to have metadata file names: {'behavior_session_table', 'ophys_experiment_table', 'ophys_session_table'} but it has {'behavior_session_table', 'ophys_experiment_table', 'ophys_cells_table', 'ophys_session_table'}

This is less of a problem since it is a failure that arises from using an out-of-date version of the SDK. We probably don't want to fix this behavior. We should probably add more information to the error message, though.

danielsf commented 2 years ago

The salient issue appears to be what is in _manifest_last_used.txt and whether or not that manifest includes all of the expected metadata

danielsf commented 2 years ago

@matchings I could use some feedback.

Ultimately, the problem is that the VisualBehaviorOphysProjectCache includes in it a hard-coded list of metadata files it expects to exist. Whenever you load a manifest (essentially, load up a version of the dataset), it checks that the manifest you are loading has all of the expected metadata files. If not, it throws an exception.

This behavior probably made sense until we added a metadata file in August. Now, any manifest pre-dating that release, will raise the exception because it lacks the ophys_cells_table.

It is fairly easy to add some logic that, in the event that this exception occurs, will just warn the user and then download and load the most up-to-date data release manifest.

So, if I were running SDK version 12.3.1 and I asked for dataset version 0.3.0, the SDK would say

The manifest you tried to load
visual-behavior-ophys_project_manifest_v0.3.0.json
 is out of date with this version of the SDK. We will try to load the most up-to-date manifest
successfully loaded
visual-behavior-ophys_project_manifest_v1.0.1.json

and load dataset v1.0.1 for me.

Is that the correct behavior?

I'll admit, I'm not 100% clear what the justification for validating the existence of all expected metadata files was (I think I was sitting on a jury when this was implemented). An alternative approach to this problem would be to make the cache more forgiving in the case of missing metadata files and add some warnings in, for instance, get_ophys_cells_table that alert the user that "your currently loaded dataset does not have that table." If we go that route, I would want to ask the rest of Pika to weigh in, since there might be a good reason for validating metadata contents that I am not thinking of.

What are your thoughts?

In summary, the options are

a) just load the most up-to-data dataset version automatically whenever the user asks for something that is too old for this SDK version

b) make the SDK more tolerant of loading old datasets

matchings commented 2 years ago

Providing my thoughts on the two problems described in this issue:

problem 1 - attempting to load an old dataset (0.3.0) with a newer SDK version

In this case, i think your suggestion to provide an error message / warning that says "the manifest you are trying to load is out of date with this version of the SDK" makes sense. But, i think there should be 2 options for what would happen after that. One is to download the most up to date dataset that is compatible with the nwere SDK version, and the other is to tell the user what SDK version is compatible with the older dataset.

I can imagine a situation where a user has been doing analysis with 0.3.0, updates the SDK, gets this error, but still wants to continue working with 0.3.0. I suppose then they could just revert to their previous SDK version, but not everyone might know how to do that properly. Is it possible to know what SDK version belongs with a given dataset version and convey that information to the user?

matchings commented 2 years ago

Alternatively, if you checkout v2.11.0 and attempt to create a new cache with

from allensdk.brain_observatory.behavior.behavior_project_cache import (
    VisualBehaviorOphysProjectCache)

import os

cache_dir = '/allen/aibs/informatics/danielsf/sdk_cache_211021'
assert not os.path.isdir(cache_dir)

cache = VisualBehaviorOphysProjectCache.from_s3_cache(cache_dir=cache_dir)

you get

RuntimeError: expected S3CloudCache object to have metadata file names: {'behavior_session_table', 'ophys_experiment_table', 'ophys_session_table'} but it has {'behavior_session_table', 'ophys_experiment_table', 'ophys_cells_table', 'ophys_session_table'}

This is less of a problem since it is a failure that arises from using an out-of-date version of the SDK. We probably don't want to fix this behavior. We should probably add more information to the error message, though.

problem 2 - attempting to load a new dataset with an older SDK version

This is actually the more common issue that we have encountered. I generated a cache to be used for all platform paper analysis, with some specific SDK version, then other scientists tried to load data from that cache (presumably with a different version of the SDK) and got the error "expected S3CloudCache object to have metadata file names" and were very confused about what to do.

So for this case, i think the fix is just to provide more a more specific warning / error message that tells the user why things aren't working and what to do about it. Something like "your SDK version (2.11.0) is not compatible with this dataset version (v1.0.0), please update the SDK to version 2.13.1".

Does that seem reasonable?

danielsf commented 2 years ago

@matchings (sorry this conversation has been so laggy; there aren't enough cooks in Pika's kitchen at the moment)

The problem 2 you identify is pretty easy to fix. Each dataset manifest includes the version of the SDK that was used to create it. This is, in effect, a minimum compatible version that we can tell people to load.

Problem 1 is trickier. The dataset only has the minimum compatible version, i.e. v1.0.1 of the dataset lists SDK version 2.12.2 as its compatible SDK version. SDK version 2.13.1 is also compatible with dataset v1.0.1, but there's no way to know that from just the dataset. So, in your example where a user is working with v0.3.0 of the data, updates their SDK, and finds themselves getting errors, the error message will say "Please install version 2.11.0 of the SDK", when versions 2.11.1, 2.11.2, and 2.11.3 are all valid versions, as well. We don't really have a great way to alert the user at run time that 2.11.3 is valid as well. The user would just have to install each version of the SDK and try to load the dataset.

Is that good enough?

On a related note: are we sure we don't want to be aggressively pushing people towards updated versions of the dataset. So far, whenever we've released a new version of the dataset, it's because we've found something wrong with how the SDK generated the data. We fix the SDK, fix the dataset, and release a new dataset. If you, as a user, really want the benefits of a newer version of the SDK, then you really want the benefits of a newer version of the dataset. I'm not sure you can really have one without the other and still be assured that what you are computing is self-consistent.

danielsf commented 2 years ago

In case this conversation goes stale, here is the branch where I started trying to implement a solution. I got as far as "have the SDK automatically download the latest manifest if what it has access to is incompatible".

https://github.com/AllenInstitute/AllenSDK/tree/ticket/2231/cache/version/mismatch

(this is not actually the solution we want, based on comments above)