AllenInstitute / AllenSDK

code for reading and processing Allen Institute for Brain Science data
https://allensdk.readthedocs.io/en/latest/
Other
340 stars 149 forks source link

experiment_table in from_lims version of VB project cache appears to be restricted to published data only #2243

Open matchings opened 2 years ago

matchings commented 2 years ago

AllenSDK 2.12.4

In the past, loading the ophys_experiment_tableusing the from_limsmethod of VisualBehaviorOphysProjectCachewould return all available data in lims. Currently, it appears to be limited to the released dataset (which has 1941 experiments and 704 sessions). In contrast, the ophys_session_tableloaded using from_limsmethod includes many more datasets, including ones from other non-published project codes.

image

The desired default behavior of the from_limsmethod should be to return everything available in lims, including failed unpublished data. Filtering to the released data only should only happen if data_release_date=['03-25-2021', '08-12-2021'] is provided as input to from_lims().

matchings commented 2 years ago

Now, on AllenSDK 2.12.4, I am getting even weirder results using the.from_lims() method. It is returning fewer experiments than what is in the release. In addition, i noticed that there is an input passed_only, so I tried setting it to False (it should be False by default), and I still get fewer experiments back than what is in the release, which should not be the case.

image

Something about the functionality of VisualBehaviorOPhysProjectCachehas clearly changed significantly and this is a major problem for us - if we cannot identify the data we need, we cant do analysis.

matchings commented 2 years ago

@aamster i wonder if you might know something about this since you have been doing lots of refactoring. What is the expected behavior of this bit of code? What should it return? Everything that is in lims? Only certain project codes? does it depend on experiment_workflow_state?

cache = VisualBehaviorOphysProjectCache.from_lims() experiments_table = cache.get_ophys_experiment_table()

matchings commented 2 years ago

@danielsf tagging you also because i see your name on some old commits related to adding the passed_onlyFlag, which I suspect may be related to this issue

aamster commented 2 years ago

@matchings this does not appear to be an SDK issue. The records for the containers in the LIMS DB seems to have been deleted (by some other process that our team doesn't manage) for the 1941 - 543 = 1398 missing experiments. This issue is being investigated.

matchings commented 2 years ago

that's pretty terrifying...

matchings commented 2 years ago

@wbwakeman could you speak to this a bit? how would it be possible for container records in lims to be deleted? can this be recovered?

wbwakeman commented 2 years ago

I think I have fixed the issue that @aamster mentioned. Will still need to follow up with Sean about mouse-seeks since that system reverts the fixes that I have done periodically.

I'm not sure that addresses the problem originally raised in this issue however.

It sure would be nice if these issues include all of the code needed to reproduce the issue.

matchings commented 2 years ago

Apologies for the lack of code to reproduce the issue. Here is a code block that, on allensdk 2.13.0, gives me 1941 experiments, all of which are container_workflow_state=='published' and experiment_workflow_state=='passed'.

What i expect to get back (and used to get back before ~2 seeks ago) is many thousands of experiments, including those where container_workflow_state can be something other than 'published', and values of experiment_workflow_state includes 'failed'.

from allensdk.brain_observatory.behavior.behavior_project_cache import VisualBehaviorOphysProjectCache

cache = VisualBehaviorOphysProjectCache.from_lims()
experiments_table = cache.get_ophys_experiment_table()
len(experiments_table)
print(experiments_table.container_workflow_state.unique())
print(experiments_table.experiment_workflow_state.unique())
aamster commented 2 years ago

@matchings It looks like the above issue is that the default value of passed_only is True instead of False. We can make that change.

The current logic is for passed_only to mean passed and published as you can see here: https://github.com/AllenInstitute/AllenSDK/blob/master/allensdk/brain_observatory/behavior/behavior_project_cache/tables/experiments_table.py#L68

Is this logic correct or do you want passed_only to mean passed but not necessarily published?

matchings commented 2 years ago

@aamster ah, right, i forgot about the passed_only flag.

I am not sure I see much utility of having an option for passed_only as it is currently configured. First off, it is only one or two lines of code to filter for passed and/or published experiments, so internal users can easily do that themselves. But more importantly, as you point out, the current implementation of passed_only is to give back experiments that are both passed and published, which mean different things and is a bit misleading. A user may want published data only (which are by definition also passed), or they may want unpublished experiments where experiment_workflow_state='passed', or they may want unpublished experiments where both experiment_workflow_state='passed', and container_workflow_state='passed', or various other combinations of criteria.

What makes most sense to me is to have a flag for published_only, and to set it to False by default. That way the two options are: 1) give me the same dataset that is on AWS, or 2) give me everything in lims and i will do my own custom filtering.

aamster commented 2 years ago

Technically we already support those workflows.

cache_lims = VisualBehaviorOphysProjectCache.from_lims(data_release_date=['03-25-2021', '08-12-2021'])
experiments_table = cache_lims.get_ophys_experiment_table()

returns only experiments on S3

cache_lims = VisualBehaviorOphysProjectCache.from_lims()
experiments_table = cache_lims.get_ophys_experiment_table(
        passed_only=False)

will return all experiments, letting the user do their own filtering.

Are you still requesting changes?

matchings commented 2 years ago

I would prefer if passed_only would default to False, and only filter on experiment_workflow_state=='passed' and not also for container_workflow_state='published', because the published use case is accomplished by the first example you gave above, and i would not intuitively expect something that is called passed_only to also include a filter on published.

But I understand that you all have many other priorities, so we can work around this in the short term, now that I understand what is going on. In the long run, I think making these changes would be useful, but we are unblocked for now.

danielsf commented 2 years ago

FWIW passed_only was implemented to address this issue

https://github.com/AllenInstitute/AllenSDK/issues/2178

Whatever we do, we need to make sure we don't accidentally introduce failed containers into the data that gets released in NWB files.

matchings commented 2 years ago

ah yes, that was a fun issue.

definitely agree about being careful with release data. If the passed_only being set to True by default is necessary for that, then it does change my calculus. I guess i didn't think of that argument as being linked to filtering for data releases right away.

danielsf commented 2 years ago

I back-channeled Marina. She is satisfied that

experiments_table = cache.get_ophys_experiment_table(passed_only=False)

does the right thing. The big outstanding question appears to be whether passed_only should default to False or True. Our internal users probably would prefer default=False. Those of us whose job it is to curate data releases would probably prefer default=True. Not sure how much demand there is for being able to get at the released data from_lims (as opposed to from_nwb) without having to set passed_only by hand (i.e. is there a third community using from_lims who aren't power users but also aren't Pika).

wbwakeman commented 2 years ago

I don't think there is a third community to consider.

Setting the default to "False" poses no risk to releasing "unpassed" data, correct? The work that it creates is that Pika needs to add a passed_only = True arg somewhere?

Based on those issues, I am okay changing the default to 'False'

matchings commented 2 years ago

@danielsf For Visual Behavior project, some of our analyses require getting both passed and failed experiments using from_lims, but this is done by power users who can handle passing passed_only=False.

For phase 4 projects, accessing data using from_lims is a requirement because nothing is published yet. I have already started using cache.get_ophys_experiment_table(passed_only=False)to identify and access phase 4 data, such as the for the LearningmFISHDevelopment project code.

Ideally we would be using a non-project specific cache to get data for new projects, but the VisualBehaviorOphysProjectCache already works to identify data in lims, so I have been using it.

Newer users for phase 4 projects may be confused by passed_onlydefaulting to True, but i think that is probably beyond the scope of this specific issue, which is about Visual Behavior project users.