AllenInstitute / AllenSDK

code for reading and processing Allen Institute for Brain Science data
https://allensdk.readthedocs.io/en/latest/
Other
340 stars 150 forks source link

Loading experiments table requires downloading all pkl files #2532

Open mattjdavis opened 1 year ago

mattjdavis commented 1 year ago

mTRAIN dependencies were removed from the SDK as discussed in the mTRAIN working group. Now behavior session information is imported directly from PKL files. Consequently, when loading all experiments in the experiments table call, all the pkls are downloaded to extract the required information, which is a very slow process. This has disrupted the science teams workflow, and I'm posting this issue to open up discussion on the best way to address this. We have been working from a commit before this change was merged, but would like to get back sync the main branch (https://github.com/AllenInstitute/AllenSDK/commit/2557135586b85bd9dcf87e2fd55d41bb4f3ce7cc). Should we follow a different procedure for grabbing experiments? Or should the PKL file information be extracted and stored on the backend so we can load experiment tables quickly? Tagging @aamster since I know you implemented this change, curious to hear y'all thoughts. Thanks!

code to replicate:

cache = VisualBehaviorOphysProjectCache.from_lims()
experiments_table = cache.get_ophys_experiment_table(passed_only=False)
aamster commented 1 year ago

Please pass a higher number for n_workers to get_ophys_experiment_table. os.cpu_count() uses all available cores. Then it loads in reasonable time.

mattjdavis commented 1 year ago

thanks for the tip, not sure if I reached reasonable time yet, On my 12 core machine I'm looking at ~ 2hours to load the table

image

aamster commented 1 year ago

I see. We didn't really consider the use case of interactively pulling this table, especially not passing passed_only=False. It was more intended for generating the project metadata table of released sessions, which is generated off-line.

The only place we would store the values we read from the pkl file in an automated way is the LIMS database, which is already being done, but the process that does this isn't working for some reason (we don't own that code), which is why we needed to jump through hoops to read directly from the pkl file.

  1. Can you run this once and save the results to csv or do you expect the data you are pulling to change?
  2. We are pulling session type from the pkl file since it was incorrect/missing in the LIMS database. Do you need session type? I'm assuming yes. If not, then I can add a flag to not pull session type.
  3. If you are ok with pulling session type from LIMS (which is missing/incorrect), then can you or can I help you write a sql query to pull the data you need instead of using this code which reads from pkl file?
matchings commented 1 year ago

@aamster We use this functionality to get the metadata tables from lims very very often as we are primarily working with recently acquired, unpublished data, thus we do expect the data we are pulling to change. This is why i requested the default to be passed_only=False a while back (https://github.com/AllenInstitute/AllenSDK/issues/2243), because loading everything that is available in lims is the default use case for most internal users who use .from_lims() for recent data.

We definitely need session_type, that's one of the most critical pieces of information, so flagging to not pull session type isn't ideal.

I see that there is a separate issue (https://github.com/AllenInstitute/AllenSDK/issues/2534) regarding correcting the session_type in lims so that it can be pulled from there and not the pkl file. This seems like the preferred solution, but I understand that will take a while because it depends on the lims team.

Any other ideas for a workaround in the mean time? Maybe a flag that allows us to load the table using session_type from lims (knowing that it may be incorrect) that we can then get rid of once the info in lims is correct? I would rather have access to the session_type even if it may sometimes be incorrect than not have it at all. There are other places I can double check if it is right or not if i am suspicious of the values, but its at least a kind of "best guess" that we can work from (as opposed to no information about session_type at all).

aamster commented 1 year ago

@matchings I forgot that for behavior sessions without ophys data, the LIMS database doesn't store session type, which is why we had to get this value from mtrain, which is also sometimes missing/incorrect.

I want to avoid cluttering the codebase with this temporary logic to pull from the LIMS DB/mtrain, but I can provide code that does this that you can run until the LIMS database is correctly populated. Will this work?

matchings commented 1 year ago

Yes, we can work with separate code that gets the tables until lims content is fixed.

Will the lims work include getting the session_type from the pkl file for behavior only sessions and storing it in lims?

aamster commented 1 year ago

Here you go:

from typing import Optional, List

import pandas as pd

from allensdk.brain_observatory.behavior.behavior_project_cache.project_apis\
    .data_io import \
    BehaviorProjectLimsApi
from allensdk.brain_observatory.behavior.behavior_project_cache.tables\
    .experiments_table import \
    ExperimentsTable
from allensdk.brain_observatory.behavior.behavior_project_cache.tables\
    .sessions_table import \
    SessionsTable
from allensdk.internal.api import PostgresQueryMixin
from allensdk.internal.api.queries.utils import build_in_list_selector_query, \
    _sanitize_uuid_list

def session_stage_from_foraging_id(
        mtrain_engine: PostgresQueryMixin,
        foraging_ids: Optional[List[str]] = None) -> pd.DataFrame:
    """
    Get DataFrame mapping behavior_sessions.id to session_type
    by querying mtrain for foraging_ids
    """
    # Select fewer rows if possible via behavior_session_id
    if foraging_ids is not None:
        foraging_ids = [f"'{fid}'" for fid in foraging_ids]
    # Otherwise just get the full table from mtrain
    else:
        foraging_ids = None

    foraging_ids_query = build_in_list_selector_query(
        "bs.id", foraging_ids)

    query = f"""
        SELECT
            stages.name as session_type,
            bs.id AS foraging_id
        FROM behavior_sessions bs
        JOIN stages ON stages.id = bs.state_id
        {foraging_ids_query};
        """
    return mtrain_engine.select(query)

def get_ophys_experiment_table():
    api = BehaviorProjectLimsApi.default()
    experiments = api.get_ophys_experiment_table()\
        .reset_index()
    behavior_sessions = api._get_behavior_summary_table()
    foraging_ids = behavior_sessions['foraging_id']
    foraging_ids = foraging_ids[~foraging_ids.isna()]
    foraging_ids = _sanitize_uuid_list(foraging_ids)
    stimulus_names = session_stage_from_foraging_id(
        mtrain_engine=api.mtrain_engine,
        foraging_ids=foraging_ids)
    behavior_sessions['session_type'] = \
        behavior_sessions['foraging_id'].map(
            stimulus_names.set_index('foraging_id')['session_type'])
    behavior_sessions = SessionsTable(
        df=behavior_sessions, fetch_api=api)

    experiments = behavior_sessions.table.merge(
        experiments, on='behavior_session_id',
        suffixes=('_behavior', '_ophys'))
    experiments = experiments.set_index('ophys_experiment_id')
    experiments = ExperimentsTable(df=experiments)

    return experiments.table

if __name__ == '__main__':
    experiments = get_ophys_experiment_table()
    pass

@matchings yes, if the LIMS team agrees to take on that work.

mattjdavis commented 1 year ago

Note for users changed the line experiments = ExperimentsTable(df=experiments) to experiments = ExperimentsTable(df=experiments, passed_only=False) to get all experiments.

matchings commented 1 year ago

Thanks @mattjdavis i was just going to comment that the code was only returning passed experiments. I changed the line and now it works.

mattjdavis commented 1 year ago

Separate issue, but the table is not returning any experiments after 9/02, there should be about a dozen more since, according to mouse seeks. Somehow these are not in LIMS?

matchings commented 1 year ago

Better to use the code for most applications because things in lims are always changing, but i also saved the output of that function to a csv here: "\allen\programs\mindscope\workgroups\learning\ophys\learning_project_cache\ophys_experiments_table_220914.csv"

matchings commented 1 year ago

@mattjdavis that is weird...should not be the case. I see an experiment from 9/13 (yesterday) in mouse-seeks that has data in lims and is in LearningmFISHDevelopmentproject code and has passed QC:

\allen\programs\mindscope\production\learning\prod0\specimen_1194433365\ophys_session_1210540434\ophys_experiment_1210712502

It should definitely be included in the table.

mattjdavis commented 1 year ago

@matchings also just discovered that the dox mice 623975 623972

are not in this table, but are in mouse-seeks/LIMS. they were recorded before 9/02. So we don't have clarity on which experiments made it to this table. which is critical since we are using it for the re motion correction

Sorry for clogging the issue, seems like I should make a separate one, not sure which repo. This seems like a LIMS thing?

aamster commented 1 year ago

@mattjdavis at least for 1210712502 the issue is that the experiment is missing from the ophys_experiments_visual_behavior_experiment_containers table in LIMS

mattjdavis commented 1 year ago

thanks adam, that was clarifying. We figured out why those don't have container ids so its not a problem anymore

matchings commented 1 year ago

@aamster just for the record, i think @mattjdavis meant that the results we were getting from the code you provided were no longer problematic, not that the experiments table creation requiring downloading all pkl files (the original issue) was not a problem anymore.

mattjdavis commented 1 year ago

@aamster The code posted above is throwing this error (when on latest master of SDK, previously I tested on a much earlier commit). No rush though as I a workaround

image

aamster commented 1 year ago

@mattjdavis could you please try rc/2.16.1 branch? Sorry about that.

mattjdavis commented 1 year ago

Thanks for the suggestion, it works now. However on this branch I am only getting "passed" experiments, even when experiments = ExperimentsTable(df=experiments, passed_only=False) is set in the code you posted above

aamster commented 1 year ago

Please use api = BehaviorProjectLimsApi.default(passed_only=False) instead of api = BehaviorProjectLimsApi.default()

The argument in ExperimentsTable doesn't actually do anything in rc/2.16.1 and needs to be removed

mattjdavis commented 1 year ago

works perfect now, thanks!

matchings commented 1 year ago

@aamster any update from the platform team on whether they will be able to add the session_type from the pkl file to LIMS? is there anything I can do to help move that along?

aamster commented 1 year ago

@matchings it has been assigned to a dev on the LIMS team and is currently being worked on. I'll let you know if there are any questions.

aamster commented 1 year ago

@matchings we look up the session_type using the following lookup

data['items']['behavior']['params']['stage']

however, not all pkl files for all behavior sessions have a stage key in the params dict. An example is:

 with open('/allen/programs/braintv/production/neuralcoding/prod38/specimen_639389525/behavior_session_689312186/180424170457_352471.pkl', 'rb') as f:
    data = pickle.loads(f, encoding='latin1')
>>> 'stage' in data['items']['behavior']['params']
>>> False
  1. What is different about this pkl file that it lacks a stage key?
  2. How do we identify the different types of pkl files?
  3. How can we get session_type for pkl files without the stage key?
matchings commented 1 year ago

As the team who owns the code that produces the pkl files, I think that MPE would be best suited to answer these questions, @rhytnen in particular.

aamster commented 1 year ago

@rhytnen ^^

matchings commented 1 year ago

@aamster I've recently learned that @corbennett and @seanmcculloch have created or are aware of tools / logic to identify different types of pkl files, in case this is still relevant

mattjdavis commented 1 year ago

has this been resolved?

aamster commented 1 year ago

Hi @mattjdavis I reached out to MPE about this issue several months ago and didn't get a response since the platform dev got hung up on trying to backfill session type in sessions in which the pkl file has an unexpected schema or lacks a session type value. This work should be completed by the platform team. I messaged the platform team to figure out how to proceed.

matchings commented 1 year ago

@aamster Thanks for the update. I will follow up with the platform team also. Was there a Service Now ticket for this task? If so, could I be tagged on it so I can be in the loop?

aamster commented 1 year ago

@matchings there is only a jira issue: https://alleninstitute-brainscience.atlassian.net/browse/DT-4139

matchings commented 1 year ago

@aamster great, thanks for sharing this, and for updating the issue.