Closed mgandal closed 7 years ago
Here is an example using wiggletools write_bg - trim regions.bed indexed_alignments.bam
with a BAM file with numeric chromosomes. The BAM file is indexed and coordinate sorted:
@HD VN:1.4 SO:coordinate
@SQ SN:1 LN:249250621 UR:file:/RefGenome/genome.fa M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ SN:2 LN:243199373 UR:file:/RefGenome/genome.fa M5:a0d9851da00400dec1098a9255ac712e
@SQ SN:3 LN:198022430 UR:file:/RefGenome/genome.fa M5:fdfd811849cc2fadebc929bb925902e5
@SQ SN:4 LN:191154276 UR:file:/RefGenome/genome.fa M5:23dccd106897542ad87d2765d28a19a1
@SQ SN:5 LN:180915260 UR:file:/RefGenome/genome.fa M5:0740173db9ffd264d728f32784845cd7
@SQ SN:6 LN:171115067 UR:file:/RefGenome/genome.fa M5:1d3a93a248d92a729ee764823acbbc6b
@SQ SN:7 LN:159138663 UR:file:/RefGenome/genome.fa M5:618366e953d6aaad97dbe4777c29375e
@SQ SN:8 LN:146364022 UR:file:/RefGenome/genome.fa M5:96f514a9929e410c6651697bded59aec
@SQ SN:9 LN:141213431 UR:file:/RefGenome/genome.fa M5:3e273117f15e0a400f01055d9f393768
@SQ SN:10 LN:135534747 UR:file:/RefGenome/genome.fa M5:988c28e000e84c26d552359af1ea2e1d
@SQ SN:11 LN:135006516 UR:file:/RefGenome/genome.fa M5:98c59049a2df285c76ffb1c6db8f8b96
@SQ SN:12 LN:133851895 UR:file:/RefGenome/genome.fa M5:51851ac0e1a115847ad36449b0015864
@SQ SN:13 LN:115169878 UR:file:/RefGenome/genome.fa M5:283f8d7892baa81b510a015719ca7b0b
@SQ SN:14 LN:107349540 UR:file:/RefGenome/genome.fa M5:98f3cae32b2a2e9524bc19813927542e
@SQ SN:15 LN:102531392 UR:file:/RefGenome/genome.fa M5:e5645a794a8238215b2cd77acb95a078
@SQ SN:16 LN:90354753 UR:file:/RefGenome/genome.fa M5:fc9b1a7b42b97a864f56b348b06095e6
@SQ SN:17 LN:81195210 UR:file:/RefGenome/genome.fa M5:351f64d4f4f9ddd45b35336ad97aa6de
@SQ SN:18 LN:78077248 UR:file:/RefGenome/genome.fa M5:b15d4b2d29dde9d3e4f93d1d0f2cbc9c
@SQ SN:19 LN:59128983 UR:file:/RefGenome/genome.fa M5:1aacd71f30db8e561810913e0b72636d
@SQ SN:20 LN:63025520 UR:file:/RefGenome/genome.fa M5:0dec9660ec1efaaf33281c0d5ea2560f
@SQ SN:21 LN:48129895 UR:file:/RefGenome/genome.fa M5:2979a6085bfe28e3ad6f552f361ed74d
@SQ SN:22 LN:51304566 UR:file:/RefGenome/genome.fa M5:a718acaa6135fdca8357d5bfe94211dd
@SQ SN:X LN:155270560 UR:file:/RefGenome/genome.fa M5:7e0e2e580297b7764e31dbc80c2540dd
@SQ SN:Y LN:59373566 UR:file:/RefGenome/genome.fa M5:1fa3474750af0948bdf97d5a0ee52e51
For simplicity, I made a BED file with only one region per chromosome. If I sort the BED file by chromosome numerically sort -k1,1n -k2,2n -k3,3n regions_noCh_v1.bed
, I get the following error:
Position 10:714133 is before 9:32552799
even though the BED file is corrected sorted by chromosome in numeric order:
1 6257711 6257893 . 0 .
2 1872090 1872390 . 0 .
3 197682531 197683078 . 0 .
4 602915 603654 . 0 .
5 176279272 176279378 . 0 .
6 16430725 16431049 . 0 .
6 166265710 166267024 . 0 .
7 713588 713686 . 0 .
8 121457307 121457378 . 0 .
9 32552798 32552977 . 0 .
10 714132 714671 . 0 .
11 14520383 14520556 . 0 .
12 6601303 6601633 . 0 .
13 114561069 114562803 . 0 .
14 23495058 23495584 . 0 .
15 101812398 101813061 . 0 .
16 648960 649176 . 0 .
17 79857795 79858128 . 0 .
18 3253886 3254048 . 0 .
19 58906048 58906170 . 0 .
20 60963364 60963575 . 0 .
21 35286753 35286804 . 0 .
If instead I sort the BED file by chromosome lexographically (as if there were a "chr" prefix), wiggletools runs but only returns outputs for chromosomes 1-9 and misses everything from chr 10+. Here is the sorted BED file
1 6257711 6257893 . 0 .
10 714132 714671 . 0 .
11 14520383 14520556 . 0 .
12 6601303 6601633 . 0 .
13 114561069 114562803 . 0 .
14 23495058 23495584 . 0 .
15 101812398 101813061 . 0 .
16 648960 649176 . 0 .
17 79857795 79858128 . 0 .
18 3253886 3254048 . 0 .
19 58906048 58906170 . 0 .
2 1872090 1872390 . 0 .
20 60963364 60963575 . 0 .
21 35286753 35286804 . 0 .
3 197682531 197683078 . 0 .
4 602915 603654 . 0 .
5 176279272 176279378 . 0 .
6 16430725 16431049 . 0 .
6 166265710 166267024 . 0 .
7 713588 713686 . 0 .
8 121457307 121457378 . 0 .
9 32552798 32552977 . 0 .
Hello @mgandal ,
Thanks for your bug report. This appears to be a bug rather with the BAM parser, which does not read the chromosomes in the expected alphabetical order, rather the order it was created in.
I'll have to modify it the.
Cheers,
Daniel
The wiggletools trim function appears to give incomplete output when comparing a BED and BAM file that have different chromosomal orderings. The BED file has to be ordered by the first column, but if the genomic nomenclature uses a "chr" prefix the sorting has to be in lexigraphic order (e.g., chr1 then chr10). However, coordinate sorted BAM files can have chromosomal ordering as dictated by the header, not always in this same lexographical or numeric ordering.
If the BAM file ordering is chr1, chr2, ... chr10, ... chr22 abd the BED file is chr1, chr10, ... then the wiggle_tools trim will only output intersection of intervals on chromosomes 1-9 and misses the overlaps on the chromosomes 10-22.
If I try and sort the BED file to match the numeric ordering of BAM file, I get the following error: "Bed file ./tmp.bed is not sorted! Position chr10:714133 is before chr9:140899137"
even though in the actual BED file chr9 is before chr10
Is there a way to fix this?