chapmanb / bcbb

Incubator for useful bioinformatics code, primarily in Python and R
http://bcbio.wordpress.com
610 stars 243 forks source link

FastQC vs SolexaQA #25

Closed tanglingfung closed 13 years ago

tanglingfung commented 13 years ago

Brad, It's not really an issue. But I want to know, from your experience, how much time you would save from switching to FastQC from SolexaQA?

Thanks, Paul

tanglingfung commented 13 years ago

and our system appears to be overloaded after recalibration using GATK, and then the system will slow down a lot afterwards. Do you have any suggestion on this?

Thanks a lot! Paul

chapmanb commented 13 years ago

Paul; FastQC is much faster than SolexaQA. For large 100bp paired end HiSeqs, SolexaQA was taking upwards of 8 hours. FastQC runs in a couple hours, and has more useful details like overrepresented kmers and sequences. Sorry for the change, but hopefully this makes it easier to install and faster for the future.

For your GATK issues, how much memory does your machine have? Perhaps our memory settings now (6Gb per process) are too high and it's causing swapping. I could make that configurable if it would be helpful. Thanks, Brad

tanglingfung commented 13 years ago

Brad,

I actually like the change. I am just concerned with consistence for some of our projects. But I agree with you that FastQC has more useful details.

Our machine currently has 48G RAM with 2x12 cores. I guess it would be more reasonable to add more RAM in our case. May I know the configuration of your system? and how much time does it take to go through the whole pipeline? (e.g. in a SNP calling analysis)

Thanks, Paul

By the way, we are working on a script that creates symbolic link to fastq files in different flow cell directories and put them in a virtual one to be processed by automated_initial_analysis.py. I think that would be useful for multiple samples run in multiple flowcells

chapmanb commented 13 years ago

Paul; Glad you like FastQC approach. It's really helpful for debugging and is actively developed for these large flowcells, which is really useful as well. You should grab the latest change I just checked in which avoids some LaTeX issues with some of the FastQC output -- some percents and other characters need to be escaped for LaTeX.

We have 48G of RAM but only 8 cores; we need more processors to deal with the new HiSeq. If you can run up to 24 processes currently, you would expect some memory swapping on barcoded HiSeq lanes. I'll work on making that a configurable parameter; it might slow down GATK but would at least save tons of swapping slowness.

Full barcoded SNP calling analyses can take a couple of days to process on that machine; it can be even longer if you have lots of barcodes as you need to wait for cores.

Let me know when you have your script finished. I'd be very happy to link to it from the documentation or include it as a utility for others. Thanks again, Brad

tanglingfung commented 13 years ago

Thanks.

I may make a mistake on the number of cores. Maybe ours is also 8 core. But we are mostly handling 2x100bp runs. I cannot imagine the computation challenge when Illumina triple the throughput of HiSeq this summer. I was thinking if it's better to move the pipeline to a cluster or dedicate different tasks to different (but physically attached) servers. I noticed that the demand of CPU and RAM varies a lot throughput the pipeline. We are thinking of how to best utilize the computer resources.

By the way, what's the best practice to use git to stay up to date with the pipeline? I'm interested to contribute utility scripts when they are settled down.

Best, Paul

On Sun, May 1, 2011 at 9:52 AM, chapmanb reply@reply.github.com wrote:

Paul; Glad you like FastQC approach. It's really helpful for debugging and is actively developed for these large flowcells, which is really useful as well. You should grab the latest change I just checked in which avoids some LaTeX issues with some of the FastQC output -- some percents and other characters need to be escaped for LaTeX.

We have 48G of RAM but only 8 cores; we need more processors to deal with the new HiSeq. If you can run up to 24 processes currently, you would expect some memory swapping on barcoded HiSeq lanes. I'll work on making that a configurable parameter; it might slow down GATK but would at least save tons of swapping slowness.

Full barcoded SNP calling analyses can take a couple of days to process on that machine; it can be even longer if you have lots of barcodes as you need to wait for cores.

Let me know when you have your script finished. I'd be very happy to link to it from the documentation or include it as a utility for others. Thanks again, Brad

Reply to this email directly or view it on GitHub: https://github.com/chapmanb/bcbb/issues/25#comment_1083635

brainstorm commented 13 years ago

Hello Paul,

I'm also running Brad's pipeline in production. I use a rather naïve approach to launch the automatic initial analysis, but so far has worked acceptably well (~2 days of processing on average per Run). It consists of putting a wrapper in place in post_process.yaml:

(...) analysis: process_program: illumina_run_batch.sh

instead of the default "automated_initial_analysis.py"

Illumina_run_batch.sh will queue the job on a cluster and launch the analysis on a single machine, using all 8 cores.

All this assumes that you have a "beowulf"-type cluster in place, together with a batch queueing system (perhaps you can ask your IT staff?):

http://en.wikipedia.org/wiki/Beowulf_(computing) http://en.wikipedia.org/wiki/Batch-queuing_system

As I said, this is just a hack, better ways to parallelize/optimize this need to be worked on further.

Regarding the best practice to use git, I would recommend you to "fork" Brad's repository by following this guide:

http://help.github.com/fork-a-repo/

Once you're happy with your changes, you may issue "Pull requests" towards Brad:

http://help.github.com/pull-requests/

Hope it all helps ! ;)

chapmanb commented 13 years ago

Paul; Yes, the computational demands are a challenge. Luckily it was written to be parallelizable, but to handle the new HiSeq it'll need more cores as you suggest. Clusters are one possibility; I'd be happy to hear what you come up with.

Roman is spot on with his GitHub suggestions. Once you make a fork you can keep a repository of your own scripts in utils or wherever, and we can merge ones back into the main trunk. While you are developing you can keep pulling in changes from the main repository and git will help with merging differences.

Thanks guys.

tanglingfung commented 13 years ago

Thanks Roman and Brad.

Yes, I think the script is doing very well for a single flowcell with 8-cores, 48G RAM. It can be done in 2-3 days. No problem with that. And I also looking into the "beowulf"-type cluster. It makes configuration easier by syncing the OS of the servers.

Thanks again for all the advices and help here!

Best, Paul

tanglingfung commented 13 years ago

Brad,

I have just tried to current version of the pipeline. However, the text from FastQC is still weird and the subtitle is missing.

Paul

chapmanb commented 13 years ago

Paul; I'm not sure what you mean, can you be more specific on the problems you're seeing? I didn't add in captions on the figures for FastQC, if that's what you mean, as they have more useful titles than the previous plots. What text problems are you encountering?

tanglingfung commented 13 years ago

Brad,

Sorry for being unclear. I found that it's a problem from FastQC. The text problem I was having is also found on the png exported from FastQC. Sorry about that.

Thanks, Paul

On Thu, May 5, 2011 at 3:00 PM, chapmanb reply@reply.github.com wrote:

Paul; I'm not sure what you mean, can you be more specific on the problems you're seeing? I didn't add in captions on the figures for FastQC, if that's what you mean, as they have more useful titles than the previous plots. What text problems are you encountering?

Reply to this email directly or view it on GitHub: https://github.com/chapmanb/bcbb/issues/25#comment_1108484

tanglingfung commented 13 years ago

we're getting stable with the pipeline now and have plans to move the analysis part to the cluster. Thanks again for all the helping. I have also started to fork the repository, and hopefully I can start to contribute back to the pipeline development.

Thanks again for all the helps in the past few months.

chapmanb commented 13 years ago

Paul; Thanks for the message. That's great to hear -- really happy things are working out. Let me know when you have changes to merge back in. Thanks again, Brad