Evaluate ancpbids as a successor to bids.layout

adelavega commented 2 years ago

The ancpbids project has made formidable progress in implementing a BIDSLayout-like API, with marked performance improvements, using a different underlying implementation with a custom query language.

Given limited resources to maintain key community-led BIDS infrastructure, it is important to compare pybids to ancp-bids, and evaluate the possibility of combining efforts in order to prevent a fragmentation of the ecosystem.

adelavega commented 2 years ago

ancpBIDS:

instead of SQL, we have a custom query language without any external dependencies

the data structure used to query against is an in-memory graph representation of the BIDS dataset

the schema definitions are used to generate code that can assist in developing pipelines using professional IDEs (like PyCharm or VS Code) AND to drive the validation

most concepts are implemented as plug-ins

each BIDS schema version is represented in its own Python module which allows to dynamically choose the schema according to the BIDSVersion field in the dataset (at the moment, 1.7.0 and 1.7.1 are available)

most important aspects of BIDSLayout are covered, no modality specific functions to keep the API clean

Additional note: There is an exemplary analysis pipeline which demonstrates how ancpBIDS can be used: https://github.com/ANCPLabOldenburg/ancp-bids-apps/blob/main/ancpbidsapps/analysis/nilearn_first_level.py

Originally posted by @erdalkaraca in https://github.com/bids-standard/pybids/issues/818#issuecomment-1085949208

adelavega commented 2 years ago

The main current limitations of pybids:

Poor performance on large datasets We should determine is to what extent SQL is the cause of the performance problems in pybids, or if its more related to meta-data ingestion and poorly optimized supporting code.
Maintainability @tsalo raised the concern that the SQL underpinnings of pybids are difficult for people to contribute to. We need to evaluate if ancpbids is any easier to contribute to for new-ish contributors. If anything, SQL is something more ppl have familiarity with as its used in many other places. A developers guide for pybids would also be useful.
Extensibility pybids does not use BIDS Schema (yet). It is unclear to what extent decisions have been hardcoded into pybids. See: https://github.com/bids-standard/pybids-light/issues/1
Modularity pybids has become somewhat bloated dependency wise. What are the heavy dependencies that we would like to get rid of from a "core" or "light" version of pyBIDS, and to what extent they are a problem? Can this be solved with extras in Python packaging, or is it better to fully modularize each sub-module into own pypi package / repo. My main concern here is that sometimes splitting things up into different code bases also causes its own set of headaches. cc: @yarikoptic See: https://github.com/bids-standard/pybids/issues/817

adelavega commented 2 years ago

An informative exercise may be to use ancpbids instead of BIDSLayout in other modules of pybids such as variables and modeling. This would highlight differences in functionality between the two packages. We could also create a suite of benchmarks that either package should be able to do in order to more rigorously evaluate them (also in terms of performance)

erdalkaraca commented 2 years ago

I initially analyzed the runtime performance behavior of pybids: most of the time was used to initialize the SQL database using SQLAlchemy. Note that SQLAlchemy as a object relational manager is rather used in business applications providing significant convenience mechanisms to simplify communication with a database when the business domain is rather complex. For pybids, it would have been better to not use SQLAlchemy as a mediation layer as there are only a handful of domain entities (BIDSFile, Entity, FileAssociation, etc.). I.e. directly communicating with the underlying SQLite database without SQLAlchemy (using the db-api package) would already give significant performance benefits.

erdalkaraca commented 2 years ago

I think that was the initial intention behind a "pybids lite" which mutated into ancpBIDS.

There is already a very simple benchmark (as unit tests) that we could further elaborate on: https://github.com/ANCPLabOldenburg/ancp-bids/blob/main/tests/manual/test_benchmark.py

An informative exercise may be to use ancpbids instead of BIDSLayout in other modules of pybids such as variables and modeling. This would highlight differences in functionality between the two packages. We could also create a suite of benchmarks that either package should be able to do in order to more rigorously evaluate them (also in terms of performance)

erdalkaraca commented 2 years ago

I just ran the "benchmark" unit test. Note the performance difference (4 sec vs. 1 min):

adelavega commented 2 years ago

That's great. Question: is ancp bids indexing meta-data by default? I think it would be useful to try pybids in those unit tests with index_metadata=False to evaluate the impact of meta-data.

erdalkaraca commented 2 years ago

ancpbids scans the file system once and builds up the graph in-memory. As the amount of meta-data is very low compared to the imaging data, meta-data is part of the graph as additional leaf nodes. As memory access is fast, there is no need to index meta-data. For pybids, I just tried with index_metadata=False and execution time decreased to 53 sec.

If you have a more specific test case, let me know, I am happy to try out.

adelavega commented 2 years ago

Another idea is to try to run the main unit tests in pybids w/ ancp bids. Obviously many will fail simply because the APIs are not identical, but it would be useful to compare differences.

erdalkaraca commented 2 years ago

There are several unit tests which use BIDSLayout as their querying interface, for example: https://github.com/ANCPLabOldenburg/ancp-bids/blob/main/tests/auto/test_query.py

I may refactor/extract some use cases into the benchmark to make it more exhaustive. I filed an new issue to track this idea: https://github.com/ANCPLabOldenburg/ancp-bids/issues/49

adelavega commented 2 years ago

@erdalkaraca I'm trying to reconcile your performance numbers w a previous investigation into performance:

see:

Basically, we concluded that most time was actually spent in os.walk and not in SQLAlchemy. Hence, I'm a bit confused how you manage to optimize on this so much. It's possible we didn't profile this correctly and made the wrong conclusion.

cc: @gkiar do you remember where your profiling results were?

Note: some of this profiling is pretty old, so its possible pybids pre SQL was EVEN slower

I also think its interesting to note that @tyarkoni noted he moved to SQL for maintainability, which goes against @tsalo's experience.

Not sure I have the right answer here but just trying to pin down the difference in approaches.

gkiar commented 2 years ago

My profiling results are in https://github.com/bids-standard/pybids/issues/285 ; good luck!

adelavega commented 2 years ago

@erdalkaraca what's the best way to contact you? we're having a bids event soon that i'd like to invite you to participate in to discuss this further. if you want, you can reach me at the email listed on my profile.

erdalkaraca commented 2 years ago

On our lab server (using network storage) the dataset used in the benchmark took 6.5 mins to load/index using pybids. Of the 6.5 mins I could pinpoint 50% execution time to sqlalchemy interaction, the rest mostly being file system operations like os.walk as you mentioned. On a lab dataset with lots of derivatives, a lab member reported that pybids took more than an hour to load/index the dataset and about 2 min when re-using the database. With ancpbids it took only 13 sec.

@erdalkaraca I'm trying to reconcile your performance numbers w a previous investigation into performance:

see:

BIDSLayout performance on very large datasets, continued #609

pybids 0.9.4 issue with slowness #521

Basically, we concluded that most time was actually spent in os.walk and not in SQLAlchemy. Hence, I'm a bit confused how you manage to optimize on this so much. It's possible we didn't profile this correctly and made the wrong conclusion.

cc: @gkiar do you remember where your profiling results were?

Note: some of this profiling is pretty old, so its possible pybids pre SQL was EVEN slower

I also think its interesting to note that @tyarkoni noted he moved to SQL for maintainability, which goes against @tsalo's experience.

Not sure I have the right answer here but just trying to pin down the difference in approaches.

erdalkaraca commented 2 years ago

@gkiar out of curiosity: could you use ancpbids to load the dataset you mentioned in #285?

you would have to from ancpbids import BIDSLayout after pip install ancpbids

gkiar commented 2 years ago

@gkiar out of curiosity: could you use ancpbids to load the dataset you mentioned in #285?

you would have to from ancpbids import BIDSLayout after pip install ancpbids

Happily! I'm a bit swamped and don't have the reference environment anymore, so it'll have to wait a bit, but consider it added to my plate :slightly_smiling_face: If you attend the BIDS meeting @adelavega mentioned, we can also sync up more about this, then.

gkiar commented 2 years ago

Updated profiling...

Tests

In [1]: def ancp_test(pth):
    ...:     import time
    ...:     from ancpbids import BIDSLayout
    ...:     start = time.time()
    ...:     bl = BIDSLayout(pth)
    ...:     dur = time.time() - start
    ...:     return bl, dur
    ...: 

In [2]: def pybids_test(pth):
    ...:     import time
    ...:     from bids.layout import BIDSLayout
    ...:     start = time.time()
    ...:     bl = BIDSLayout(pth)
    ...:     dur = time.time() - start
    ...:     return bl, dur
    ...:

Results

Nsubs	Library	Time
492	`pybids`	348.6 s
492	`ancpbids`	7.7 s
1334	`pybids`	1215.4 s
1334	`ancpbids`	17.3 s

Next Step

Verify that metadata indexing is of similar quality
Evaluate similarity in API (in particular, for metadata extraction)
Identify source of super-linear slowdown in pybids

@erdalkaraca this is looking great! I'll keep poking at it this week. Will you be online?

erdalkaraca commented 2 years ago

Thanks a lot for posting the results, Greg @gkiar

Yes, I will be online for the zoom/discord call this week to talk about the implementation.

bids-standard / pybids