bioinformatics-ua / dicoogle

Dicoogle - Open Source PACS
http://www.dicoogle.com/
GNU General Public License v3.0
447 stars 132 forks source link

Deprecate IndexerInterface#index(Iterable<StorageInputStream>)? Optimize batch indexing? #150

Open Enet4 opened 8 years ago

Enet4 commented 8 years ago

Indexer interfaces currently need to implement 2 methods related to indexation:

public Task<Report> index(StorageInputStream file);

public Task<Report> index(Iterable<StorageInputStream> files);

The second one is only an aggregation of multiple indexation tasks, which brings no advantages when performed on the plugin itself (it's also an open door to mistakes). I propose that we deprecate it in one of our future minor or major revisions of Dicoogle, so that it can be smoothly deleted.

This should also favour a balanced usage of threads for tasks after we solve #140.

Enet4 commented 8 years ago

I have talked to @fmgvalente on the subject, and he argued that index(Iterable<_>) can improve performance because a plugin implementation can commit the documents to the database in a batch (as in, performing commit once per batch instead of once per file).

It's a tricky decision. On the one hand, a batched commit can make a significant impact on performance if the I/O is the bottleneck. On the other hand, not every case of batched indexation are effectively interpreted as a batched indexation (we got #139 for an equivalent reason), and the responsibility of parallelizing a batched indexation task is not that well outlined (should be either the plugin or the plugin controller, preferably not both). Right now, if I want 4 threads to index a directory with the CBIR indexer, I need to call the index service 4 times on 4 partitions of the directory. Also note that I/O may no longer be the bottleneck when it comes to visual feature extraction.

Making this issue a question so that we make a decision together. We have at least 3 choices:

Enet4 commented 2 months ago

This has been mitigated at the plugins by accumulating indexing requests and flushing them in batches internally. This is best accompanied by plugin shutdown enabled so that the last pieces of data can be indexed before shutting down. Not sure if we are moving forward with this right now.

bastiao commented 2 months ago

We may close this issue, as there is no real impact at current use cases?