ccgd-profile / BreaKmer

A method to identify structural variation from sequencing data in target regions
31 stars 11 forks source link

HG19 and GRCh format #4

Closed huangchv closed 9 years ago

huangchv commented 9 years ago

I'm testing BreaKmer out on bam files that were previously mapped against hg19 and going into them to remove 'chr' will break the files for other people and at the same time, double the amount of space we need. Are there ways around this?

ryanabo commented 9 years ago

Do you have the same reference these were aligned against? If you have this reference fasta with the appropriate 'non-chr' labelled sequences as well as a matching annotation file and bed file, then you should not have to modify the bam files. These minor differences are somewhat safeguarded against within BreaKmer. You can always try it out without changing anything. I can always try to modify something to make this an easier process. Let me know how it runs without the modifications.

huangchv commented 9 years ago

I was previously running BreaKmer with an unmodified hg19 fasta file and got this error: File "/oicr/x86_64/boutroslab/scripts/breakmer-0.0.7/utils.py", line 532, in extract_refseq_fa seq = ref_d[chrom].seq[(s-200):(e+200)] KeyError: '15'

Figured it was because the chr was in there. I stripped them out of the fasta file and then I was getting an error with pysam:

File "pysam/calignmentfile.pyx", line 725, in pysam.calignmentfile.AlignmentFile.fetch (pysam/calignmentfile.c:9853) File "pysam/calignmentfile.pyx", line 639, in pysam.calignmentfile.AlignmentFile._parseRegion (pysam/calignmentfile.c:8849) ValueError: invalid reference 15

ryanabo commented 9 years ago

Ok, this is an odd error. Can you send your bed file and reference fasta (or a link to dropbox)?

Ryan

On Tue, Mar 3, 2015 at 9:45 AM, huangchv notifications@github.com wrote:

I was previously running BreaKmer with an unmodified hg19 fasta file and got this error: File "/oicr/x86_64/boutroslab/scripts/breakmer-0.0.7/utils.py", line 532, in extract_refseq_fa seq = ref_d[chrom].seq[(s-200):(e+200)] KeyError: '15'

Figured it was because the chr was in there. I stripped them out of the fasta file and then I was getting an error with pysam:

File "pysam/calignmentfile.pyx", line 725, in pysam.calignmentfile.AlignmentFile.fetch (pysam/calignmentfile.c:9853) File "pysam/calignmentfile.pyx", line 639, in pysam.calignmentfile.AlignmentFile._parseRegion (pysam/calignmentfile.c:8849) ValueError: invalid reference 15

— Reply to this email directly or view it on GitHub https://github.com/a-bioinformatician/BreaKmer/issues/4#issuecomment-76958055 .

huangchv commented 9 years ago

The bed file is the one that was included in the example_data directory. As for a copy of the hg19 file I am using, do you have a preferred email address I can send the link to?

ryanabo commented 9 years ago

Ok, you can send the link to ryan.abo@gmail.com

Thanks.

On Thu, Mar 5, 2015 at 10:12 AM, huangchv notifications@github.com wrote:

The bed file is the one that was included in the example_data directory. As for a copy of the hg19 file I am using, do you have a preferred email address I can send the link to?

— Reply to this email directly or view it on GitHub https://github.com/a-bioinformatician/BreaKmer/issues/4#issuecomment-77380672 .

ryanabo commented 9 years ago

I have obtained and tested the reference file that you sent. Unfortunately, I cannot recreate the error that you have posted. To better track this down, can you send me your configuration file as well as the bed file? Can you also send the version of pysam and python that you are using?

huangchv commented 9 years ago

Hmmm, odd. I've sent the files to your email. I am using: python 2.7.2 pysam 0.8.2.1

ryanabo commented 9 years ago

Ok after reviewing this issue for awhile, I think it was as simple as making sure that the chromosome names or ids are consistent across your files. From your original bug report you said you had 'chr' in your reference fasta file. When you removed the 'chr' from the fasta file, you received another error, which indicated that your bam file was aligned to the reference fasta with 'chr' because your fasta file chromosome ids were different from your bam file. In the end, all you need to do was to use 'chr' in your input bed file like this: chr15 45003745 45003811 B2M exon chr15 45007621 45007922 B2M exon chr15 45008527 45008540 B2M exon

These files just need to have consistent chromosome names:

  1. Reference fasta
  2. Bed file
  3. Bam file
ryanabo commented 9 years ago

I am closing this issue until further notice.