NeurodataWithoutBorders / pynwb

A Python API for working with Neurodata stored in the NWB Format
https://pynwb.readthedocs.io
Other
178 stars 84 forks source link

[Bug]: Adding Large Stimulus Table with add_interval takes incredibly long #1946

Closed rcpeene closed 3 months ago

rcpeene commented 3 months ago

What happened?

I am trying to generate an NWB file with a rather large stim table row by row using TimeIntervals.add_interval(). The stim table for our experiment happens to be very large (>40,000 rows). Trying on two different machines, this takes more than 10 hours to do. The add_interval operation seems to be the bottleneck, and it takes greater amounts of time as the table gets longer.

After digging through the code it looks like it might be __calculate_idx_count, perhaps bisect.

Is there a more direct way to generate a TimeIntervals table from an existing table (while ensuring that types of each columns are properly casted)? Or is there a fix to the slowness of the add_interval operation?

Steps to Reproduce

Running this snippet when generating a TimeIntervals object with a very large table

        presentation_interval = create_stimulus_presentation_time_interval(
            name=f"{stim_name}_presentations",
            description=interval_description,
            columns_to_add=cleaned_table.columns,
        )

        for i, row in enumerate(cleaned_table.itertuples(index=False)):
            row = row._asdict()
            row = {key: str(value) for key, value in row.items()}
            start_column = 'Start'  # Adjust this as per the actual column name in CSV
            end_column = 'End'  # Adjust this as per the actual
            start_time = float(row[start_column])
            end_time = float(row[end_column])
            presentation_interval.add_interval(
                **row,
                start_time=start_time, stop_time=end_time,
                tags="stimulus_time_interval", timeseries=ts
            )

        nwbfile.add_time_intervals(presentation_interval)

Traceback

No traceback

Operating System

Windows

Python Executable

Conda

Python Version

3.10

Package Versions

pynwb==2.8.1

Code of Conduct

rcpeene commented 3 months ago

The real bottleneck appears to be in DynamicTable.add_row()

stephprince commented 3 months ago

Hi @rcpeene,

One way to speed up the add_interval operation would be to add the argument check_ragged=False. We recently added this check to provide a better warning for ragged arrays, but this operation can cause performance issues for larger tables since it checks the data on each call to add_row / add_interval.

presentation_interval.add_interval(
    **row,
    start_time=start_time, stop_time=end_time,
    tags="stimulus_time_interval", timeseries=ts, check_ragged=False
)

Could you try setting check_ragged to False and see if that improves your performance?

rcpeene commented 3 months ago

This was remarkably faster and completed in a few minutes. Thanks!