hxin commented 6 years ago

Currently the function run STAR one sample at a time using ${NUM_CORE}. Test if running more samples in parallel and use less core for each sample.

hxin commented 6 years ago

I tested the performance of parallel run with samples from the glucose project.

Running the map_reads for each sample at a time with 16 cores takes around 17 hours for all 48 samples mapping to mouse/rat, so that is 96 STAR runs, which makes it 10mins/run. Thus, for 4 samples, this will roughly take 40mins and 80mins for 8 samples.

Comparing to the numbers in the table, there is a big increase in turns of speed when it using multiple cores in parallel. Thus, I think this is worth discussing further. @lweasel

Samples	Core(s)/sample	Time(minutes)
4(C1 C2 C3 C4)	1	54.4166
4(C1 C2 C3 C4)	2	29.4833
4(C1 C2 C3 C4)	4	18.5833
4(C1 C2 C3 C4)	8	15.0000
4(C1 C2 C3 C4)	16	15.5166
8(C1 C2 D3 B1 B2 B4 D4 D1 C3 C4 B3 D2 )	1	61.9333
8(C1 C2 D3 B1 B2 B4 D4 D1 C3 C4 B3 D2 )	2	34.9000
8(C1 C2 D3 B1 B2 B4 D4 D1 C3 C4 B3 D2 )	4	19.1833
8(C1 C2 D3 B1 B2 B4 D4 D1 C3 C4 B3 D2 )	8	18.5666
8(C1 C2 D3 B1 B2 B4 D4 D1 C3 C4 B3 D2 )	16	19.8500

sample size: C1 3.4G C2 3.4G D3 3.9G B1 3.3G B2 3.7G B4 3.4G D4 3.9G D1 4.0G C3 3.4G C4 3.4G B3 3.4G D2 4.3G

hxin commented 6 years ago

also, something interesting is that the time needed does not change much from 8cores to 16cores, so in turns of STAR runs, in the current implementation, using 8 cores and 16 core are not making much difference.

hxin commented 6 years ago

Further tests have been done to test the performance of the implementation.

The parallelized code was added to sort_reads/map_reads/filter_reads and I used the same 6 samples/2 species to run Sargasso with two different configurations:

non-paral-version of sargasso 2.paral-version sargasso

Samples	Core(s)/sample	Max-cores	run1(minutes)	run2(minutes)
C1 C2 C5 C6 C9 C10	16	16	251	286
C1 C2 C5 C6 C9 C10	8	16	204	264

It seems that the increase is not as good as the individual test on the sort_reads/map_reads/filter_reads where, in general, the paral-version uses halftime with a 8/16 setup.

This may due to the fact that the performance may be limited by IO, rather than processing speed, when running these jobs together.

Overall, there is an increase by parallelizing the sort_reads/map_reads/filter_reads process.

hxin commented 6 years ago

row_number	job	core	total_core	run	time	time_pre_run	sample_pre_run	total_sample
1	filtered	10	16	7.5	108.2833	13.535412	1.600000	12
2	filtered	12	16	9.0	57.1666	6.351844	1.333333	12
3	filtered	12	32	4.5	56.4833	11.296660	2.666667	12
4	filtered	16	16	12.0	110.0833	9.173608	1.000000	12
5	filtered	16	32	6.0	55.6333	9.272217	2.000000	12
6	filtered	2	16	1.5	75.4500	37.725000	8.000000	12
7	filtered	4	16	3.0	70.2000	23.400000	4.000000	12
8	filtered	4	32	1.5	65.5333	32.766650	8.000000	12
9	filtered	6	16	4.5	74.4166	14.883320	2.666667	12
10	filtered	8	16	6.0	70.2500	11.708333	2.000000	12
11	filtered	8	32	3.0	71.5333	23.844433	4.000000	12
12	mapped	12	16	9.0	75.2833	8.364811	1.333333	12
13	mapped	12	32	4.5	72.7000	14.540000	2.666667	12
14	mapped	16	16	12.0	148.5666	12.380550	1.000000	12
15	mapped	16	32	6.0	77.3833	12.897217	2.000000	12
16	mapped	2	16	1.5	158.8500	79.425000	8.000000	12
17	mapped	4	16	3.0	97.6166	32.538867	4.000000	12
18	mapped	4	32	1.5	110.5833	55.291650	8.000000	12
19	mapped	8	16	6.0	80.5500	13.425000	2.000000	12
20	mapped	8	32	3.0	77.0166	25.672200	4.000000	12
21	sorted	12	16	9.0	34.6000	3.844444	1.333333	12
22	sorted	12	32	4.5	33.5333	6.706660	2.666667	12
23	sorted	16	16	12.0	49.7666	4.147217	1.000000	12
24	sorted	16	32	6.0	27.4166	4.569433	2.000000	12
25	sorted	2	16	1.5	76.8000	38.400000	8.000000	12
26	sorted	4	16	3.0	45.8666	15.288867	4.000000	12
27	sorted	4	32	1.5	48.1666	24.083300	8.000000	12
28	sorted	8	16	6.0	41.9833	6.997217	2.000000	12
29	sorted	8	32	3.0	39.2000	13.066667	4.000000	12

find ${HOME}/tmp/test_speed/results/mouse/ -name '*' | grep -P 'run|time'|sort

a="filtered/10/16/run/7.5
filtered/10/16/time/108.2833
filtered/12/16/run/9.0
filtered/12/16/time/57.1666
filtered/12/32/run/4.5
filtered/12/32/time/56.4833
filtered/16/16/run/12.0
filtered/16/16/time/110.0833
filtered/16/32/run/6.0
filtered/16/32/time/55.6333
filtered/2/16/run/1.5
filtered/2/16/time/75.4500
filtered/4/16/run/3.0
filtered/4/16/time/70.2000
filtered/4/32/run/1.5
filtered/4/32/time/65.5333
filtered/6/16/run/4.5
filtered/6/16/time/74.4166
filtered/8/16/run/6.0
filtered/8/16/time/70.2500
filtered/8/32/run/3.0
filtered/8/32/time/71.5333
mapped/12/16/run/9.0
mapped/12/16/time/75.2833
mapped/12/32/run/4.5
mapped/12/32/time/72.7000
mapped/16/16/run/12.0
mapped/16/16/time/148.5666
mapped/16/32/run/6.0
mapped/16/32/time/77.3833
mapped/2/16/run/1.5
mapped/2/16/time/158.8500
mapped/4/16/run/3.0
mapped/4/16/time/97.6166
mapped/4/32/run/1.5
mapped/4/32/time/110.5833
mapped/8/16/run/6.0
mapped/8/16/time/80.5500
mapped/8/32/run/3.0
mapped/8/32/time/77.0166
sorted/12/16/run/9.0
sorted/12/16/time/34.6000
sorted/12/32/run/4.5
sorted/12/32/time/33.5333
sorted/16/16/run/12.0
sorted/16/16/time/49.7666
sorted/16/32/run/6.0
sorted/16/32/time/27.4166
sorted/2/16/run/1.5
sorted/2/16/time/76.8000
sorted/4/16/run/3.0
sorted/4/16/time/45.8666
sorted/4/32/run/1.5
sorted/4/32/time/48.1666
sorted/8/16/run/6.0
sorted/8/16/time/41.9833
sorted/8/32/run/3.0
sorted/8/32/time/39.2000"

require(tidyr)
require(dplyr)
require(ggplot2)
read.table(text=a,col.names=c('raw')) %>% 
  tidyr::separate(raw,into=c('job','core','total_core','type','value'),sep="/") %>% 
  dplyr::group_by(job,core,total_core) %>%
  reshape2::dcast(job + core + total_core ~ type) %>%
  dplyr::mutate_at(vars(-job),funs(as.numeric)) %>%
  dplyr::mutate(time_pre_run=time/ceiling(run),
                sample_pre_run=total_core/core,
                total_sample=c(12)) %>%
  ggplot() + geom_point(mapping=aes(x=core, y=time, size=time_pre_run,color=total_core)) + facet_wrap(~ job)

biomedicalinformaticsgroup / Sargasso

add parallelization to the map_reads/sort_reads/filter_reads #72

find ${HOME}/tmp/test_speed/results/mouse/ -name '*' | grep -P 'run|time'|sort