bids-standard / pybids

Python tools for querying and manipulating BIDS datasets.
https://bids-standard.github.io/pybids/
MIT License
219 stars 121 forks source link

Evaluate ancpbids as a successor to bids.layout #831

Open adelavega opened 2 years ago

adelavega commented 2 years ago

The ancpbids project has made formidable progress in implementing a BIDSLayout-like API, with marked performance improvements, using a different underlying implementation with a custom query language.

Given limited resources to maintain key community-led BIDS infrastructure, it is important to compare pybids to ancp-bids, and evaluate the possibility of combining efforts in order to prevent a fragmentation of the ecosystem.

adelavega commented 2 years ago

ancpBIDS:

  • instead of SQL, we have a custom query language without any external dependencies
  • the data structure used to query against is an in-memory graph representation of the BIDS dataset
  • the schema definitions are used to generate code that can assist in developing pipelines using professional IDEs (like PyCharm or VS Code) AND to drive the validation
  • most concepts are implemented as plug-ins
  • each BIDS schema version is represented in its own Python module which allows to dynamically choose the schema according to the BIDSVersion field in the dataset (at the moment, 1.7.0 and 1.7.1 are available)
  • most important aspects of BIDSLayout are covered, no modality specific functions to keep the API clean

Additional note: There is an exemplary analysis pipeline which demonstrates how ancpBIDS can be used: https://github.com/ANCPLabOldenburg/ancp-bids-apps/blob/main/ancpbidsapps/analysis/nilearn_first_level.py

Originally posted by @erdalkaraca in https://github.com/bids-standard/pybids/issues/818#issuecomment-1085949208

adelavega commented 2 years ago

The main current limitations of pybids:

adelavega commented 2 years ago

An informative exercise may be to use ancpbids instead of BIDSLayout in other modules of pybids such as variables and modeling. This would highlight differences in functionality between the two packages. We could also create a suite of benchmarks that either package should be able to do in order to more rigorously evaluate them (also in terms of performance)

erdalkaraca commented 2 years ago

I initially analyzed the runtime performance behavior of pybids: most of the time was used to initialize the SQL database using SQLAlchemy. Note that SQLAlchemy as a object relational manager is rather used in business applications providing significant convenience mechanisms to simplify communication with a database when the business domain is rather complex. For pybids, it would have been better to not use SQLAlchemy as a mediation layer as there are only a handful of domain entities (BIDSFile, Entity, FileAssociation, etc.). I.e. directly communicating with the underlying SQLite database without SQLAlchemy (using the db-api package) would already give significant performance benefits.

erdalkaraca commented 2 years ago

I think that was the initial intention behind a "pybids lite" which mutated into ancpBIDS.

There is already a very simple benchmark (as unit tests) that we could further elaborate on: https://github.com/ANCPLabOldenburg/ancp-bids/blob/main/tests/manual/test_benchmark.py

An informative exercise may be to use ancpbids instead of BIDSLayout in other modules of pybids such as variables and modeling. This would highlight differences in functionality between the two packages. We could also create a suite of benchmarks that either package should be able to do in order to more rigorously evaluate them (also in terms of performance)

erdalkaraca commented 2 years ago

I just ran the "benchmark" unit test. Note the performance difference (4 sec vs. 1 min):

image

adelavega commented 2 years ago

That's great. Question: is ancp bids indexing meta-data by default? I think it would be useful to try pybids in those unit tests with index_metadata=False to evaluate the impact of meta-data.

erdalkaraca commented 2 years ago

ancpbids scans the file system once and builds up the graph in-memory. As the amount of meta-data is very low compared to the imaging data, meta-data is part of the graph as additional leaf nodes. As memory access is fast, there is no need to index meta-data. For pybids, I just tried with index_metadata=False and execution time decreased to 53 sec.

If you have a more specific test case, let me know, I am happy to try out.

adelavega commented 2 years ago

Another idea is to try to run the main unit tests in pybids w/ ancp bids. Obviously many will fail simply because the APIs are not identical, but it would be useful to compare differences.

erdalkaraca commented 2 years ago

There are several unit tests which use BIDSLayout as their querying interface, for example: https://github.com/ANCPLabOldenburg/ancp-bids/blob/main/tests/auto/test_query.py

I may refactor/extract some use cases into the benchmark to make it more exhaustive. I filed an new issue to track this idea: https://github.com/ANCPLabOldenburg/ancp-bids/issues/49

adelavega commented 2 years ago

@erdalkaraca I'm trying to reconcile your performance numbers w a previous investigation into performance:

see:

Basically, we concluded that most time was actually spent in os.walk and not in SQLAlchemy. Hence, I'm a bit confused how you manage to optimize on this so much. It's possible we didn't profile this correctly and made the wrong conclusion.

cc: @gkiar do you remember where your profiling results were?

Note: some of this profiling is pretty old, so its possible pybids pre SQL was EVEN slower

I also think its interesting to note that @tyarkoni noted he moved to SQL for maintainability, which goes against @tsalo's experience.

Not sure I have the right answer here but just trying to pin down the difference in approaches.

gkiar commented 2 years ago

My profiling results are in https://github.com/bids-standard/pybids/issues/285 ; good luck!

adelavega commented 2 years ago

@erdalkaraca what's the best way to contact you? we're having a bids event soon that i'd like to invite you to participate in to discuss this further. if you want, you can reach me at the email listed on my profile.

erdalkaraca commented 2 years ago

On our lab server (using network storage) the dataset used in the benchmark took 6.5 mins to load/index using pybids. Of the 6.5 mins I could pinpoint 50% execution time to sqlalchemy interaction, the rest mostly being file system operations like os.walk as you mentioned. On a lab dataset with lots of derivatives, a lab member reported that pybids took more than an hour to load/index the dataset and about 2 min when re-using the database. With ancpbids it took only 13 sec.

@erdalkaraca I'm trying to reconcile your performance numbers w a previous investigation into performance:

see:

Basically, we concluded that most time was actually spent in os.walk and not in SQLAlchemy. Hence, I'm a bit confused how you manage to optimize on this so much. It's possible we didn't profile this correctly and made the wrong conclusion.

cc: @gkiar do you remember where your profiling results were?

Note: some of this profiling is pretty old, so its possible pybids pre SQL was EVEN slower

I also think its interesting to note that @tyarkoni noted he moved to SQL for maintainability, which goes against @tsalo's experience.

Not sure I have the right answer here but just trying to pin down the difference in approaches.

erdalkaraca commented 2 years ago

@gkiar out of curiosity: could you use ancpbids to load the dataset you mentioned in #285?

you would have to from ancpbids import BIDSLayout after pip install ancpbids

gkiar commented 2 years ago

@gkiar out of curiosity: could you use ancpbids to load the dataset you mentioned in #285?

you would have to from ancpbids import BIDSLayout after pip install ancpbids

Happily! I'm a bit swamped and don't have the reference environment anymore, so it'll have to wait a bit, but consider it added to my plate :slightly_smiling_face: If you attend the BIDS meeting @adelavega mentioned, we can also sync up more about this, then.

gkiar commented 2 years ago

Updated profiling...

Tests

In [1]: def ancp_test(pth):
    ...:     import time
    ...:     from ancpbids import BIDSLayout
    ...:     start = time.time()
    ...:     bl = BIDSLayout(pth)
    ...:     dur = time.time() - start
    ...:     return bl, dur
    ...: 

In [2]: def pybids_test(pth):
    ...:     import time
    ...:     from bids.layout import BIDSLayout
    ...:     start = time.time()
    ...:     bl = BIDSLayout(pth)
    ...:     dur = time.time() - start
    ...:     return bl, dur
    ...: 

Results

Nsubs Library Time
492 pybids 348.6 s
492 ancpbids 7.7 s
1334 pybids 1215.4 s
1334 ancpbids 17.3 s

Next Step

@erdalkaraca this is looking great! I'll keep poking at it this week. Will you be online?

erdalkaraca commented 2 years ago

Thanks a lot for posting the results, Greg @gkiar

Yes, I will be online for the zoom/discord call this week to talk about the implementation.