bmvdgeijn / WASP

WASP: allele-specific pipeline for unbiased read mapping and molecular QTL discovery
Apache License 2.0
102 stars 51 forks source link

update_total_depth takes too long #96

Closed lw157 closed 4 years ago

lw157 commented 4 years ago

Hi Bryce, I have about 50 samples, and update_total_depth.py takes too long. From my test runs, it takes about 8-10h to process each sample. I tried to modify the scripts to use multiple threads, it used too much memory, and still needs > 10 days (my estimate). Those are because fit spline step are quite time-consuming. Could I run each sample individually for the fit splines steps? Then I combine all fit spines together, and finish the "update_total" step? My concerns are, do you need all samples GC/peakness information when estimating the fit splines for each single sample? If not, then I just submit 50 jobs and finish analysis within a day.

Thanks a lot.

Liuyang

bmvdgeijn commented 4 years ago

Hi Liuyang,

There are a couple things you might try. How many loci are you fitting on? You could use to skip option to fit on a subset of loci. You could also change the tolerance of the fit function. You should also be able to run each sample separately.

Best, Bryce

Sent from my iPhone

On May 28, 2020, at 9:36 PM, Liuyang Wang notifications@github.com wrote:

 Hi Bryce, I have about 50 samples, and update_total_depth.py takes too long. From my test runs, it takes about 8-10h to process each sample. I tried to modify the scripts to use multiple threads, it used too much memory, and still needs > 10 days (my estimate). Those are because fit spline step are quite time-consuming. Could I run each sample individually for the fit splines steps? Then I each fit spines together, and finish the "update_total" step? My concerns are, do you need all samples GC/peakness information when estimating the fit splines for one sample? If not, then I just submit 50 jobs and finish analysis within a day.

Thanks a lot.

Liuyang

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

lw157 commented 4 years ago

Hi Liuyang, There are a couple things you might try. How many loci are you fitting on? You could use to skip option to fit on a subset of loci. You could also change the tolerance of the fit function. You should also be able to run each sample separately. Best, Bryce Sent from my iPhone On May 28, 2020, at 9:36 PM, Liuyang Wang @.***> wrote:  Hi Bryce, I have about 50 samples, and update_total_depth.py takes too long. From my test runs, it takes about 8-10h to process each sample. I tried to modify the scripts to use multiple threads, it used too much memory, and still needs > 10 days (my estimate). Those are because fit spline step are quite time-consuming. Could I run each sample individually for the fit splines steps? Then I each fit spines together, and finish the "update_total" step? My concerns are, do you need all samples GC/peakness information when estimating the fit splines for one sample? If not, then I just submit 50 jobs and finish analysis within a day. Thanks a lot. Liuyang — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Hi Bryce, Thanks for your quick response. For each sample, I have ~ 2 million loci. As long as fit splines estimate for each sample is independent from each other, then it is feasible to run them separately.

Liuyang

bmvdgeijn commented 4 years ago

I would definitely suggest downsampling during the fitting process. A few thousand loci is plenty at that point. This can be achieved with the skip option.

Bryce

Sent from my iPhone

On May 29, 2020, at 2:16 AM, Liuyang Wang notifications@github.com wrote:

 Hi Liuyang, There are a couple things you might try. How many loci are you fitting on? You could use to skip option to fit on a subset of loci. You could also change the tolerance of the fit function. You should also be able to run each sample separately. Best, Bryce … Sent from my iPhone On May 28, 2020, at 9:36 PM, Liuyang Wang @.***> wrote:  Hi Bryce, I have about 50 samples, and update_total_depth.py takes too long. From my test runs, it takes about 8-10h to process each sample. I tried to modify the scripts to use multiple threads, it used too much memory, and still needs > 10 days (my estimate). Those are because fit spline step are quite time-consuming. Could I run each sample individually for the fit splines steps? Then I each fit spines together, and finish the "update_total" step? My concerns are, do you need all samples GC/peakness information when estimating the fit splines for one sample? If not, then I just submit 50 jobs and finish analysis within a day. Thanks a lot. Liuyang — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Hi Bryce, Thanks for your quick response. For each sample, I have ~ 2 million loci. As long as fit splines estimate for each sample is independent from each other, then it is feasible to run them separately.

Liuyang

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

lw157 commented 4 years ago

Hi Bryce, This make sense. I have not paid attention to this option. This will definitely save lots of execution time. Thanks a lot. Liuyang

I would definitely suggest downsampling during the fitting process. A few thousand loci is plenty at that point. This can be achieved with the skip option. Bryce Sent from my iPhone On May 29, 2020, at 2:16 AM, Liuyang Wang @.> wrote:  Hi Liuyang, There are a couple things you might try. How many loci are you fitting on? You could use to skip option to fit on a subset of loci. You could also change the tolerance of the fit function. You should also be able to run each sample separately. Best, Bryce … Sent from my iPhone On May 28, 2020, at 9:36 PM, Liuyang Wang @.> wrote:  Hi Bryce, I have about 50 samples, and update_total_depth.py takes too long. From my test runs, it takes about 8-10h to process each sample. I tried to modify the scripts to use multiple threads, it used too much memory, and still needs > 10 days (my estimate). Those are because fit spline step are quite time-consuming. Could I run each sample individually for the fit splines steps? Then I each fit spines together, and finish the "update_total" step? My concerns are, do you need all samples GC/peakness information when estimating the fit splines for one sample? If not, then I just submit 50 jobs and finish analysis within a day. Thanks a lot. Liuyang — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Hi Bryce, Thanks for your quick response. For each sample, I have ~ 2 million loci. As long as fit splines estimate for each sample is independent from each other, then it is feasible to run them separately. Liuyang — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.