Closed markmipt closed 3 years ago
Hi Mark,
Thanks for making us aware of this.
It seems sometimes people come to the same conclusions around the same time. A few days ago we concluded the same when faced with a similar issue.
Unfortunately I currently do not have the time to fix this. I do have an idea how to fix this (do library predictions outside of multiprocessing). Current plan is to fix this mid-July. Will keep this issue open if anyone needs to add anything to this discussion before than.
Thanks again!
Robbin
Update: Temporarily fixed with your suggestion (v0.1.33). Will discuss this with a software developer in our team that will know a more elegant solution.
Update: After consulting with our developer he suggested to use a global var. This is now implemented since v0.1.35.
Hi Robbin!
DeepLC predictions are slow when library is not full ("--use_library" option). For example, if I'm trying to make predictions for 2000 sequences and 10 of them are not presented in the library, the prediction takes much more time compared to the predictions for 2000 sequences with no "--use_library". We have analysed your code and we are pretty sure that the problem is that multiprocessing used in "do_f_extraction_pd_parallel" method has a bottleneck due to creation of child processes takes a lot of time because self.library object is too big for fast copy process. I've tried a quick fix, which is adding line "del self.library" in deeplc.py file just a line before "# If we need to apply deep NN" and adding" self.library = {} if self.use_library: self.read_library()" right before line "# If we need to calibrate"
This fix make things much better, but still it is not optimal, since the library will be reloaded multiple times with every call of "make_preds_core" function. I don't have an idea for optimal change right now, and that is why I haven't prepared a pull request.
Regards, Mark