ccgd-profile / BreaKmer

A method to identify structural variation from sequencing data in target regions
31 stars 11 forks source link

Issues running example #10

Closed skillcoyne closed 9 years ago

skillcoyne commented 9 years ago

I'd like to run BreaKmer on my tumor data sets but I'm currently unable to run the example data. I'm not a python developer either so am not clear what the issue is. Any help would be welcome!

After setting up the breakmer example config with the following:

analysis_name=example targets_bed_file=//tools/BreaKmer/example_data/genes.bed sample_bam_file=//tools/BreaKmer/example_data/B2M.bam analysis_dir=//tools/BreaKmer/example_data/example reference_data_dir=//tools/BreaKmer/example_data/data/ref cutadapt=//tools/cutadapt/cutadapt-1.8.1/bin/cutadapt cutadapt_config_file=//tools/BreaKmer/example_data/cutadapt.cfg jellyfish=//tools/Jellyfish/jellyfish-2.2.0i/jellyfish blat=//tools/BLAT/blat gfclient=//tools/BLAT/gfClient gfserver=//tools/BLAT/gfServer fatotwobit=//tools/faToTwoBit reference_fasta=//tools/BreaKmer/ref/all.fa gene_annotation_file=//tools/BreaKmer/refseq.bed repeat_mask_file=//tools/BreaKmer/repeatmask.bed kmer_size=15

And running "python breakmer.py example_data/breakmer.cfg"

I get the following errors in stdout, but no errors in the log file.

Traceback (most recent call last): File "breakmer.py", line 103, in r = runner(config_d) File "/mnt/gaiagpfs/users/homedirs/skillcoyne/tools/BreaKmer/sv_processor.py", line 100, in init self.params = params(config_d) File "/mnt/gaiagpfs/users/homedirs/skillcoyne/tools/BreaKmer/utils.py", line 706, in init self.set_params() File "/mnt/gaiagpfs/users/homedirs/skillcoyne/tools/BreaKmer/utils.py", line 796, in set_params self.gene_annotations.add_genes(self.opts['gene_annotation_file']) File "/mnt/gaiagpfs/users/homedirs/skillcoyne/tools/BreaKmer/utils.py", line 970, in add_genes end = int(linesplit[5]) ValueError: invalid literal for int() with base 10: '+'

ryanabo commented 9 years ago

Hi and thanks for trying out BreaKmer. Sorry for the issues in running the software. From the error it looks like the gene annotation file that was input is not formatted as expected. The strand column is in the column that it expects the 'end' of the gene to be. The annotation file should look like the following:

bin name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds score name2 cdsStartStat cdsEndStat exonFrames

0 NM_032291 chr1 + 66999824 67210768 67000041 67208778 25 66999824,67091529,67098752,67101626,67105459,67108492,67109226,67126195,67 133212,67136677,67137626,67138963,67142686,67145360,67147551,67154830,67155872,67161116,67184976,67194946,67199430,67205017,67206340,67206954,67208755, 67000051,67091593,67098777,6710169 8,67105516,67108547,67109402,67126207,67133224,67136702,67137678,67139049,67142779,67145435,67148052,67154958,67155999,67161176,67185088,67195102,67199563,67205220,67206405,67207119,6721 0768, 0 SGIP1 cmpl cmpl 0,1,2,0,0,0,1,0,0,0,1,2,1,1,1,1,0,1,1,2,2,0,2,1,1, 1 NM_032785 chr1 - 48998526 50489626 48999844 50489468 14 48998526,49000561,49005313,49052675,49056504,49100164,49119008,49128823,49 332862,49511255,49711441,50162984,50317067,50489434, 48999965,49000588,49005410,49052838,49056657,49100276,49119123,49128913,49332902,49511472,49711536,50163109,50317190,50489626, 0AGBL4 cmpl cmpl 2,2,1,0,0,2,1,1,0,2,0,1,1,0, 1 NM_018090 chr1 + 16767166 16786584 16767256 16785385 8 16767166,16770126,16774364,16774554,16775587,16778332,16782312,16785336, 16767348,16770227,16774469,16774636,16775696,16778510,16782388,16786584, 0 NECAP2 cmpl cmpl 0,2,1,1,2,0,1,2, 1 NM_052998 chr1 + 33546713 33585995 33547850 33585783 12 33546713,33546988,33547201,33547778,33549554,33557650,33558882,33560148,33 562307,33563667,33583502,33585644, 33546895,33547109,33547413,33547955,33549728,33557823,33559017,33560314,33562470,33563780,33583717,33585995, 0 ADC cmpl cmpl -1 ,-1,-1,0,0,0,2,2,0,1,0,2, ...

If you paste the first couple of lines of your annotation file, I can see what needs to be modified.

skillcoyne commented 9 years ago

Ok, I had used the refseq.bed file as I thought I read that was the required annotation. I downloaded refGene.txt from UCSC and was able to run the example so all is well. Thanks for the quick response!