Closed BrightXiaoHan closed 2 years ago
Thanks for the effort, it's a useful feature if we get it working. I tried a couple of filtering cases with it, and the code runs nicely, also with VariKN and eflomal models. However, there's something wrong as the number of output lines I get are different with and without parallel processing. With very small input, seems that parallel doesn't give any output. I haven't yet tried to solve what causes these.
Other issues:
score
step fails immediately due to not havingoutputs
argument. (I haven't yet testedpreprocess
.)- Pylint gives 10 complaints on the new code, those would be nice to have fixed.
- Please rebase this on the latest develop branch.
Thanks for your reply. I will try to work on these issue recently.
It pass all testcase on my local run. But it seems that ci enviroment has no varikn installed.
It pass all testcase on my local run. But it seems that ci enviroment has no varikn installed.
Yes, I haven't bothered to add the external libraries to the CI tests. You should add
@unittest.skipIf('varikn' not in globals(), 'varikn package not installed')
before the test and also catch possible import error, see tests/test_lm_filter.py
.
Thanks for the effort, it's a useful feature if we get it working. I tried a couple of filtering cases with it, and the code runs nicely, also with VariKN and eflomal models. However, there's something wrong as the number of output lines I get are different with and without parallel processing. With very small input, seems that parallel doesn't give any output. I haven't yet tried to solve what causes these.
Other issues:
score
step fails immediately due to not havingoutputs
argument. (I haven't yet testedpreprocess
.)- Pylint gives 10 complaints on the new code, those would be nice to have fixed.
- Please rebase this on the latest develop branch.
These issue has been resolved.
I don't quite understand why the end result would still be different after providing priors. Even if you don't use multiple processes, it will still be divided into chunks by "chunksize" parameters.
I don't quite understand why the end result would still be different after providing priors. Even if you don't use multiple processes, it will still be divided into chunks by "chunksize" parameters.
The priors are, as the name says, prior parameters calculated from previous data, not the final model parameters, which are updated with the new data. I noticed the problem when I run otherwise same configuration with n_jobs = 1 and n_jobs = 4 and got very different filtered outputs.
You are absolutely right that also the chunking in score and filterfalse has the same effect. I have to admit that I haven't so far thought of it. The normal filtering is not affected, as it does not use chunking.
As this is a problem that affects also other things than the parallel processing, I'll decide what to do with it separately. I'll merge this PR now. Thank you for your efforts and patience!
Related to #28. A draft implementation. If this idea is ok, I will add some test cases. I try to add a parameter
n_jobs
to parallel functions preprocess, score, and filter. An example config like this