childmindresearch / bids2table

Efficiently index large-scale BIDS neuroimaging datasets and derivatives
https://childmindresearch.github.io/bids2table/
MIT License
13 stars 5 forks source link

Add benchmarks #14

Closed clane9 closed 1 year ago

clane9 commented 1 year ago

Add benchmarks comparing PyBIDS, ancpBIDS, and bids2table for indexing and querying large-ish datasets.

codecov[bot] commented 1 year ago

Codecov Report

Patch and project coverage have no change.

Comparison is base (886d184) 90.70% compared to head (666c854) 90.70%.

:exclamation: Current head 666c854 differs from pull request most recent head 1c0ac57. Consider uploading reports for the commit 1c0ac57 to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #14 +/- ## ======================================= Coverage 90.70% 90.70% ======================================= Files 10 10 Lines 441 441 ======================================= Hits 400 400 Misses 41 41 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

clane9 commented 1 year ago

Hey @adelavega, would love to get your thoughts on these benchmarks. Do you think the comparisons are fair? What do you think about the results (bids2table vs pybids: roughly 4x faster indexing, 150x faster with 64 workers in parallel, 90x less size on disk, 20x faster queries)?

adelavega commented 1 year ago

Overall looks good!

bids2table certainly has nice advantages:

I tried it out on this dataset (https://openneuro.org/datasets/ds002837/versions/2.0.0) which shows similar differences

pybids

{'version': '0.15.6.post0.dev97',
 'elapsed': 10.908372353000232,
 'size_mb': 46.956}

pybids w/ index_metadata=False

{'version': '0.15.6.post0.dev97',
 'elapsed': 6.4482560209999065,
 'size_mb': 1.376}

ancpbids

{'version': '0.2.2', 'elapsed': 1.459769660999882, 'size_mb': nan}

bids2table 1 core

{'version': '0.1.dev34+g886d184',
 'elapsed': 2.9782187279997743,
 'size_mb': 12.576}

bids2table 8 cores

{'version': '0.1.dev34+g886d184',
 'elapsed': 0.8461213739992672,
 'size_mb': 12.712}

Interesting that ancpbids is faster w/ one core, but I'm guessing its because it doesn't read the JSON sidecars. I'm also guessing that pybids is slowed down by a combination of SqlAlchemy overhead + general code inefficiencies which I want to dig into.

Given that you've set this benchmark up, I would try it on several public datasets to get a better estimate on the performance diffs.

adelavega commented 1 year ago

Other feedback on bids2table:

adelavega commented 1 year ago

Regarding the query benchmarks:

In [54]: %timeit layout.get_subjects()
29.3 ms ± 2.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [55]: %timeit b2t_df["sub"].unique().tolist()
113 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Interestingly, using a more efficient SQLAlchemy query I got this result in pybids (not currently implemented in master though):

In [68]: %timeit layout.session.query(Tag._value).filter_by(entity_name='subject').distinct().a
    ...: ll()
2.93 ms ± 990 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

bids2table is still much faster!

Also: ancpbids does not support indexing using meta-data, since it only loads it when it needs it and not as part of the index.

clane9 commented 1 year ago

Thanks so much for the feedback!

Given that you've set this benchmark up, I would try it on several public datasets to get a better estimate on the performance diffs.

Totally agree. Filling out the benchmark with a few more datasets makes a lot of sense. Ideally with a range of sizes and on different machines. One or both of these factors could be part of the reason why ancpbids is faster than single-thread bids2table in your example.

pandas API is not as familar as you might expect

Totally. This is also feedback I've gotten from others in my group. I think the pandas API is pretty flexible, but also pretty complicated and not all that well known. We've been discussing implementing a higher-level pybids-like API on top. Perhaps following the proposed redesign. This would also open the door for a possible merger down the road, if there was interest.

I wouldn't call the metadata "sidecar" once indexed, since it's actually a combination of various sidecar files.

Ah, because of the inheritance? I'm considering just flattening out the fields in the sidecar column into their own columns, a la pd.json_normalize, and putting them in a general "metadata" column group.

Can you try querying for all subject ids for only a specific task? i.e. a subset operation and then a unique operaiton? Also, I'm surprised pybids is that bad

Ya we were pretty surprised at how bad pybids was here. We chalked it up as an outlier. I'll try to dig into it more.

A side note on query performance. Although pandas is quite good enough here, there are now even more optimized dataframe libraries (e.g. polars, duckdb, all of which interface well with Arrow/Parquet. So there should be room for even better performance.

adelavega commented 1 year ago

Ah, because of the inheritance? I'm considering just flattening out the fields in the sidecar column into their own columns, a la pd.json_normalize, and putting them in a general "metadata" column group.

+1 on this

Re: other libraries, I would argue querying is quite fast as it is, so I would err on the side of familiarity (pandas might be complex but at least familar).

Although we could consider another db for the redesigned API project.

adelavega commented 1 year ago

To expand a bit: I think we should focus on optimizing indexing time more than querying time.

As an example, PyBIDS has some unacceptably slow queries, but looking at the worst one, its 1.2 s, which if we used SQLAlchemy more efficiently it would be an order of magnitude faster which is 0.12s.

That tells me that any of these solutions will be performant enough, if the translation between the high level API and the low-level querying language is done properly (which is the main problem in PyBIDS).

Obviously bids2table is orders of magnitude faster, which is cool and useful, but it's just to say that under a certain floor of performance, we should use other heuristics to guide us.

Where PyBIDS really struggled is the indexing time, and that's where we got most complaints. So I see that as bids2table's biggest contribution.

Let's keep this in mind when building a high-level API, because sometimes the most difficult thing is mapping that easy to use query language in a way that performs.

clane9 commented 1 year ago

Merging even though there are still a few improvements to the benchmarks needed:

These will be addressed in future PRs.