chapmanb / bcbb

Incubator for useful bioinformatics code, primarily in Python and R
http://bcbio.wordpress.com
604 stars 243 forks source link

Test suite has started taking a very long time to run #54

Closed vals closed 12 years ago

vals commented 12 years ago

Hi Brad, we have started to experience really long times for running the test suite, have the run time for tests increased for you as well? Or are we doing something odd? Running all the tests on a node with 8 cores takes about 4 hours.

Something new I have noted in the stdout from nosetests -s -v test_automated_analysis.py is thousands of lines of the form

INFO  17:41:53,026 TraversalEngine -  chr5:133867088        5.39e+07    2.4 h        2.7 m     74.0%         3.3 h    51.4 m 

which seem to perform some work for several hours.

chapmanb commented 12 years ago

Valentine; The demultiplexing test does take a long time to run, although not four hours here. Are you using a different reference genome than the one in tests/data/genomes? The lines you're seeing above are from GATK processing, and some GATK walkers will be as slow as the size of the reference genome not the number of input fastqs. The test genomes are slimmed down to a couple of chromosomes to help with this.

Practically, you shouldn't need to run the whole test suite every time unless you are specifically working on multiplexing. The variant calling workflow tests most of the major functionality:

nosetests -v -s test_automated_analysis.py:AutomatedAnalysisTest.test_1_variantcall

and is what I run when adjusting anything with the processing pipeline.

vals commented 12 years ago

Brad; Thank you for the testing guidelines, I think we will look over how and what we test.

Regarding the reference genomes, is there a setting somewhere to specify them? I might have accidentally started using other ones, but I'm not finding where.

chapmanb commented 12 years ago

Valentine; The genomes are specified by the location files in the testing directory:

https://github.com/chapmanb/bcbb/blob/master/nextgen/tests/data/automated/tool-data/sam_fa_indices.loc

I asked because you mentioned 'chr5' in your error message. Those genomes are restricted subsets with just chr22 and chrM to make things run a bit faster.

vals commented 12 years ago

Brad; Aha! While I was testing the distributed functionality I had pointed the tests to our production universe_wsgi.ini which had the information about our RabbitMQ server in it (to avoid duplicating information). And looking around, it seems that the programs are looking for a tool-data directory in the same directory as the universe_wsgi.ini. Thus I'm guessing it used all the reference genomes, and it took a long time. Thank you for your help!