choosehappy / HistoQC

HistoQC is an open-source quality control tool for digital pathology slides
BSD 3-Clause Clear License
253 stars 100 forks source link

FeatureRequest: Computation Time #308

Open koellerMC opened 2 weeks ago

koellerMC commented 2 weeks ago

Dear HistoQC Team,

I would have a feature request. Would it be possible to add a computation time metric to each image? As such one could investigate a little bit the compute for the different modules and also evaluate what would be the best settings for large scale application of histoQC (e.g. 10k+ WSIs), or potentially design a staged approach for different quality metrics within a QC pipeline.

All the best! MK

jacksonjacobs1 commented 2 weeks ago

Great feedback!

Out of curiosity, how quickly does HistoQC process each image in your use case? We typically see ~30 seconds per image.

On the development side, we typically measure module computation time using a python process profiler such as py-spy

For normal users, it's unclear how performance information would be embedded into the existing HistoQC output. I think the best, simplest option would be to log performance info at the DEBUG level.

In this proposed implementation, DEBUG logs can be forwarded to a .txt file when the user passes the --debug flag (#301 ) to HistoQC.

@choosehappy Your thoughts?

koellerMC commented 2 weeks ago

Hi @jacksonjacobs1

So far I have not tracked it. However, once we have done some proper testing I will come back with some metrics. We will use the DEBUG option as proposed. Thanks for the fast reply!

BR MK

jacksonjacobs1 commented 2 weeks ago

FYI the --debug option does not currently cause performance info to be logged.

choosehappy commented 2 weeks ago

generally speaking, having more performance metrics is likely a good thing, so that folks can make more educated decisions about what modules to include, or where particular hiccups may be. this may even reduce our support overhead if we can enable folks to kind of serve themselves

adding in module level timing should be trivial since the models are called dynamically in a for loop, so simply wrapping that statement in some timing code would easily get the job done

i think the open question for me is similar to the one that jackson brought up -- where/how do we report these metrics in an immediately usable way? if we store it in the wrong format that doesn't directly let folks address their questions and requires reformatting becuase we didn't think about it would be a shame.

at bare minimum i could imagine a few end deliverables: (a) pie chart of compute time breakdown for a single WSI - and maybe this itself is an output in a "timings" module? (b) a pie chart of the compute for all wsi?

as we think about transitioning to using something like a sqllite database this becomes much more obvious how to store (seperate table). perhaps in CSV land, a seperate file makes sense?