AllenInstitute / npc_sessions

Tools for accessing and packaging data from behavior and epyhys sessions from the Mindscope Neuropixels team, in the cloud.
1 stars 1 forks source link

Loading newscale data fails with lazy loading #118

Closed rcpeene closed 3 months ago

rcpeene commented 3 months ago

In OpenScope's metadata/upload code, loading newscale coordinates fails with lazy loading. Perhaps due to a versioning error, I'm not quite sure. For my purposes, simply removing the lazy loading sufficed as a temporary solution.

It occurs around here (line 126 in newscale.py):

    df: pl.DataFrame
    try:
        df = get_newscale_data_lazy(newscale_log_path)  # type: ignore [assignment]
    except Exception:
        df = get_newscale_data(newscale_log_path)

    # if experiment date isn't in df, the log file didn't cover this experiment -
    # we can't continue
    if start.dt.date() not in df["last_movement_dt"].dt.date():
        raise IndexError(
            f"no movement data found for experiment date {start.dt.date()} in {newscale_log_path.as_posix()}"
        )

I'm not sure what a robust solution is or what operations are/are not allowed on a lazy dataframe under these conditions. Perhaps wrapping the subsequent access of df['last_movement_dt'] into the try...except?

The traceback:

Traceback (most recent call last):
  File "C:\Users\carter.peene\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\carter.peene\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\carter.peene\Desktop\Projects\openscope_upload\process_ephys_session.py", line 205, in <module>
    main()
  File "C:\Users\carter.peene\Desktop\Projects\openscope_upload\process_ephys_session.py", line 201, in main
    generate_jsons(**vars(parse_args()))
  File "C:\Users\carter.peene\Desktop\Projects\openscope_upload\process_ephys_session.py", line 184, in generate_jsons
    project_name = generate_session_json(session_id, session, overwrite=overwrite)
  File "C:\Users\carter.peene\Desktop\Projects\openscope_upload\process_ephys_session.py", line 164, in generate_session_json
    session_mapper.generate_session_json()
  File "C:\Users\carter.peene\Desktop\Projects\aind-metadata-mapper\src\aind_metadata_mapper\open_ephys\camstim_ephys_session.py", line 135, in generate_session_json
    data_streams=self.data_streams(),
  File "C:\Users\carter.peene\Desktop\Projects\aind-metadata-mapper\src\aind_metadata_mapper\open_ephys\camstim_ephys_session.py", line 323, in data_streams
    data_streams.append(self.ephys_stream())
  File "C:\Users\carter.peene\Desktop\Projects\aind-metadata-mapper\src\aind_metadata_mapper\open_ephys\camstim_ephys_session.py", line 277, in ephys_stream
    ephys_modules=self.ephys_modules(),
  File "C:\Users\carter.peene\Desktop\Projects\aind-metadata-mapper\src\aind_metadata_mapper\open_ephys\camstim_ephys_session.py", line 222, in ephys_modules
    newscale_coords = npc_sessions.get_newscale_coordinates(
  File "C:\Users\carter.peene\Desktop\Projects\openscope_upload\upload_env\lib\site-packages\npc_sessions\utils\newscale.py", line 134, in get_newscale_coordinates
    if start.dt.date() not in df["last_movement_dt"].dt.date():
  File "C:\Users\carter.peene\Desktop\Projects\openscope_upload\upload_env\lib\site-packages\polars\lazyframe\frame.py", line 517, in __getitem__
    raise TypeError(msg)
TypeError: 'LazyFrame' object is not subscriptable (aside from slicing)
bjhardcastle commented 3 months ago

@rcpeene sorry I missed this.

I just removed the lazyframe stuff, it's not necessary. But it looks like doing a lazy scan of a csv from S3 is working now, which is nice.

For completeness, the reason it failed is you can't get a column from a lazyframe without first collecting results: df.collect()['last_movement_dt'] would have fixed it