PoonLab / OpenRDP

An open-source re-implementation of the RDP4 recombination detection program
GNU General Public License v3.0
45 stars 9 forks source link

Parallel computing of triplets #85

Open ArtPoon opened 2 months ago

ArtPoon commented 2 months ago

Since many of the recombination detection methods operate on combinations of three sequences (triplets), it should be relatively straight-forward to process these in parallel across multiple threads. I personally prefer implementing this via message passing interface (MPI) but that would add a dependency and it would be more conventional to use Python's built-in multiprocessing module.

ArtPoon commented 2 months ago
ArtPoon commented 1 month ago

I think this is the right entry point for parallel computing: https://github.com/PoonLab/OpenRDP/blob/bb0200092d6245cfd11784a1e7fd081a59507de1/openrdp/__init__.py#L270-L278

ArtPoon commented 1 month ago

After parallel processing, use gather to collect results for one process; for example: https://github.com/PoonLab/covizu/blob/12a9d08b70d754a30b72d84636affb28d3934765/covizu/clustering.py#L325-L332

WilliamZekaiWang commented 1 month ago

I think I've done it in the new iss85 branch: https://github.com/PoonLab/OpenRDP/blob/4645e4764d2d47fde957cccc1d944ac7ca36db9c/openrdp/__init__.py#L276-L300 In regards to gather, I'm not too sure how to apply it in this scenario? The tmethod.execute(triplet) doesn't return anything as explicitly as the bootstrap example in the covizu.

I tested openrdp on long.fasta under the test directory on both my changes and master branch, and they both gave the same output.

WilliamZekaiWang commented 1 month ago

I think I have now correctly implemented the gather, although I'm not sure my implementation is the most efficient. I made it so it does merge_breakpoints() in each process, then gathers them afterwards in the root.

I have ran the code with 1-16 cores and got the same output in all scenarios (the output was sometimes in different orders)

WilliamZekaiWang commented 1 month ago

I made a test fasta file with 10 sequences, each with 1000 nucleotides and ran it on 40 cores vs 1 core. The outputs were the same, but listed in different orders. Timing below

40:
real    0m20.553s
user    14m16.015s
sys     0m43.630s
1:
real    1m33.892s
user    1m40.652s
sys     0m0.380s

40 seems to be quite a bit faster. The only thing is that it vomits a bunch of text

ArtPoon commented 1 month ago

3Seq and GENECONV are running on the entire alignment as input files, so those should not be called from MPI for now. We would need to modify the source code to pass Python objects to C for this to work for these binaries, or port these methods into Python.

ArtPoon commented 1 month ago

Replace multiprocessing in bootscan with MPI for consistency

WilliamZekaiWang commented 1 month ago

I am currently in the process but I've run into an issue:

I've removed all parts of multiprocessing and changed it to MPI4PY in the iss85 branch. The outputs are different.

master branch:
Method          Start   End     Recombinant     Parent1 Parent2 Pvalue
------------------------------------------------------------------------
Geneconv        1       204     Test2           Test3   -       2.00E-05
Geneconv        151     195     Test1           Test3   -       2.10E-03
Geneconv        203     507     Test1           Test2   -       8.29E-03
Geneconv        539     759     Test1           Test2   -       1.54E-01
Geneconv        151     193     Test4           -       -       2.20E-02
Geneconv        56      170     Test1           -       -       2.73E-02
Bootscan        100     160     Test2           Test3   Test4   2.29E-02
3Seq            202     787     Test3           Test1   Test2   5.98E-10
3Seq            181     787     Test2           Test3   Test4   5.29E-06
RDP             6       433     Test1           Test3   Test4   9.28E-17
RDP             6       412     Test2           Test3   Test4   1.04E-15
mpi4py
Method          Start   End     Recombinant     Parent1 Parent2 Pvalue
------------------------------------------------------------------------
Geneconv        1       204     Test2           Test3   -       2.00E-05
Geneconv        151     195     Test1           Test3   -       2.10E-03
Geneconv        203     507     Test1           Test2   -       8.29E-03
Geneconv        539     759     Test1           Test2   -       1.54E-01
Geneconv        151     193     Test4           -       -       2.20E-02
Geneconv        56      170     Test1           -       -       2.73E-02
Bootscan        180     860     Test1           Test2   Test3   1.04E-13
Bootscan        180     860     Test1           Test2   Test4   6.06E-13
Bootscan        100     160     Test2           Test3   Test4   1.35E-02
3Seq            202     787     Test3           Test1   Test2   5.98E-10
3Seq            181     787     Test2           Test3   Test4   5.29E-06
RDP             6       433     Test1           Test3   Test4   9.28E-17
RDP             6       412     Test2           Test3   Test4   1.04E-15

looking deeper into the issue, it seems that the dt_matrix_file in both cases are the exact same. The only difference is the BootScan.raw_results

master
[('Test1 ', ('Test2', 'Test3'), 180, 860, 0.22208900176843166), ('Test2', ('Test1 ', 'Test3'), 100, 160, 0.5090064337856215), ('Test2', ('Test1 ', 'Test3'), 180, 860, 0.8755198548946761), ('Test3', ('Test1 ', 'Test2'), 100, 160, 0.058882580307752286), ('Test1 ', ('Test2', 'Test4'), 180, 860, 0.3969463145457672), ('Test2', ('Test1 ', 'Test4'), 180, 860, 0.6727008886719116), ('Test2', ('Test3', 'Test4'), 100, 160, 0.022905243454938964), ('Test3', ('Test2', 'Test4'), 100, 160, 0.4025393897717188)]

mpi4py
[('Test1 ', ('Test2', 'Test3'), 180, 860, 1.042518309852707e-13), ('Test2', ('Test1 ', 'Test3'), 100, 160, 0.5543662225935952), ('Test2', ('Test1 ', 'Test3'), 180, 860, 0.8063583710006824), ('Test3', ('Test1 ', 'Test2'), 100, 160, 0.058882580307752286), ('Test1 ', ('Test2', 'Test4'), 180, 860, 6.061565956710262e-13), ('Test2', ('Test1 ', 'Test4'), 180, 860, 0.587671445705755), ('Test2', ('Test3', 'Test4'), 100, 160, 0.013548600500603775), ('Test3', ('Test2', 'Test4'), 100, 160, 0.2693843515674098)]
WilliamZekaiWang commented 1 month ago

I have found the issue, I believe it could've been incorrectly selecting a config file for bootscan. I have looked at the outputs and everything seems to line up without using multiprocessing

ArtPoon commented 1 month ago

For testing in an MPI environment, I think we should have the root node communicate the pooled results to all other processes, so that the unit tests pass for everyone

WilliamZekaiWang commented 1 month ago

added a new unittest to test openrdp overall for all methods in an mpi environment.

successfully passes it when compared the output in the master branch on the tests/long.fasta file for all cores

I'd also like to add here that for threeseq, on my local device, one of the outputs differs with the same output from when I run it on paphlagon by 1 value in the last decimal:

local: "('Test3', ('Test1', 'Test2'), '202', '787', '5.982095e-10')"
pap:   "('Test3', ('Test1', 'Test2'), '202', '787', '5.982096e-10')"
ArtPoon commented 1 month ago

Let's use an AlmostEqual test to that level of precision, that's acceptably close to the expected answer