Ensembl / WiggleTools

Basic operations on the space of numerical functions defined on the genome using lazy evaluators for flexibility and efficiency
Apache License 2.0
144 stars 25 forks source link

Trim #19

Closed mgandal closed 7 years ago

mgandal commented 7 years ago

The wiggletools trim function appears to give incomplete output when comparing a BED and BAM file that have different chromosomal orderings. The BED file has to be ordered by the first column, but if the genomic nomenclature uses a "chr" prefix the sorting has to be in lexigraphic order (e.g., chr1 then chr10). However, coordinate sorted BAM files can have chromosomal ordering as dictated by the header, not always in this same lexographical or numeric ordering.

If the BAM file ordering is chr1, chr2, ... chr10, ... chr22 abd the BED file is chr1, chr10, ... then the wiggle_tools trim will only output intersection of intervals on chromosomes 1-9 and misses the overlaps on the chromosomes 10-22.

If I try and sort the BED file to match the numeric ordering of BAM file, I get the following error: "Bed file ./tmp.bed is not sorted! Position chr10:714133 is before chr9:140899137"

even though in the actual BED file chr9 is before chr10

Is there a way to fix this?

mgandal commented 7 years ago

Here is an example using wiggletools write_bg - trim regions.bed indexed_alignments.bam with a BAM file with numeric chromosomes. The BAM file is indexed and coordinate sorted:

@HD VN:1.4  SO:coordinate
@SQ SN:1    LN:249250621    UR:file:/RefGenome/genome.fa    M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ SN:2    LN:243199373    UR:file:/RefGenome/genome.fa    M5:a0d9851da00400dec1098a9255ac712e
@SQ SN:3    LN:198022430    UR:file:/RefGenome/genome.fa    M5:fdfd811849cc2fadebc929bb925902e5
@SQ SN:4    LN:191154276    UR:file:/RefGenome/genome.fa    M5:23dccd106897542ad87d2765d28a19a1
@SQ SN:5    LN:180915260    UR:file:/RefGenome/genome.fa    M5:0740173db9ffd264d728f32784845cd7
@SQ SN:6    LN:171115067    UR:file:/RefGenome/genome.fa    M5:1d3a93a248d92a729ee764823acbbc6b
@SQ SN:7    LN:159138663    UR:file:/RefGenome/genome.fa    M5:618366e953d6aaad97dbe4777c29375e
@SQ SN:8    LN:146364022    UR:file:/RefGenome/genome.fa    M5:96f514a9929e410c6651697bded59aec
@SQ SN:9    LN:141213431    UR:file:/RefGenome/genome.fa    M5:3e273117f15e0a400f01055d9f393768
@SQ SN:10   LN:135534747    UR:file:/RefGenome/genome.fa    M5:988c28e000e84c26d552359af1ea2e1d
@SQ SN:11   LN:135006516    UR:file:/RefGenome/genome.fa    M5:98c59049a2df285c76ffb1c6db8f8b96
@SQ SN:12   LN:133851895    UR:file:/RefGenome/genome.fa    M5:51851ac0e1a115847ad36449b0015864
@SQ SN:13   LN:115169878    UR:file:/RefGenome/genome.fa    M5:283f8d7892baa81b510a015719ca7b0b
@SQ SN:14   LN:107349540    UR:file:/RefGenome/genome.fa    M5:98f3cae32b2a2e9524bc19813927542e
@SQ SN:15   LN:102531392    UR:file:/RefGenome/genome.fa    M5:e5645a794a8238215b2cd77acb95a078
@SQ SN:16   LN:90354753 UR:file:/RefGenome/genome.fa    M5:fc9b1a7b42b97a864f56b348b06095e6
@SQ SN:17   LN:81195210 UR:file:/RefGenome/genome.fa    M5:351f64d4f4f9ddd45b35336ad97aa6de
@SQ SN:18   LN:78077248 UR:file:/RefGenome/genome.fa    M5:b15d4b2d29dde9d3e4f93d1d0f2cbc9c
@SQ SN:19   LN:59128983 UR:file:/RefGenome/genome.fa    M5:1aacd71f30db8e561810913e0b72636d
@SQ SN:20   LN:63025520 UR:file:/RefGenome/genome.fa    M5:0dec9660ec1efaaf33281c0d5ea2560f
@SQ SN:21   LN:48129895 UR:file:/RefGenome/genome.fa    M5:2979a6085bfe28e3ad6f552f361ed74d
@SQ SN:22   LN:51304566 UR:file:/RefGenome/genome.fa    M5:a718acaa6135fdca8357d5bfe94211dd
@SQ SN:X    LN:155270560    UR:file:/RefGenome/genome.fa    M5:7e0e2e580297b7764e31dbc80c2540dd
@SQ SN:Y    LN:59373566 UR:file:/RefGenome/genome.fa    M5:1fa3474750af0948bdf97d5a0ee52e51

For simplicity, I made a BED file with only one region per chromosome. If I sort the BED file by chromosome numerically sort -k1,1n -k2,2n -k3,3n regions_noCh_v1.bed, I get the following error:

Position 10:714133 is before 9:32552799

even though the BED file is corrected sorted by chromosome in numeric order:

1   6257711 6257893 .   0   .
2   1872090 1872390 .   0   .
3   197682531   197683078   .   0   .
4   602915  603654  .   0   .
5   176279272   176279378   .   0   .
6   16430725    16431049    .   0   .
6   166265710   166267024   .   0   .
7   713588  713686  .   0   .
8   121457307   121457378   .   0   .
9   32552798    32552977    .   0   .
10  714132  714671  .   0   .
11  14520383    14520556    .   0   .
12  6601303 6601633 .   0   .
13  114561069   114562803   .   0   .
14  23495058    23495584    .   0   .
15  101812398   101813061   .   0   .
16  648960  649176  .   0   .
17  79857795    79858128    .   0   .
18  3253886 3254048 .   0   .
19  58906048    58906170    .   0   .
20  60963364    60963575    .   0   .
21  35286753    35286804    .   0   .

If instead I sort the BED file by chromosome lexographically (as if there were a "chr" prefix), wiggletools runs but only returns outputs for chromosomes 1-9 and misses everything from chr 10+. Here is the sorted BED file

1   6257711 6257893 .   0   .
10  714132  714671  .   0   .
11  14520383    14520556    .   0   .
12  6601303 6601633 .   0   .
13  114561069   114562803   .   0   .
14  23495058    23495584    .   0   .
15  101812398   101813061   .   0   .
16  648960  649176  .   0   .
17  79857795    79858128    .   0   .
18  3253886 3254048 .   0   .
19  58906048    58906170    .   0   .
2   1872090 1872390 .   0   .
20  60963364    60963575    .   0   .
21  35286753    35286804    .   0   .
3   197682531   197683078   .   0   .
4   602915  603654  .   0   .
5   176279272   176279378   .   0   .
6   16430725    16431049    .   0   .
6   166265710   166267024   .   0   .
7   713588  713686  .   0   .
8   121457307   121457378   .   0   .
9   32552798    32552977    .   0   .
dzerbino commented 7 years ago

Hello @mgandal ,

Thanks for your bug report. This appears to be a bug rather with the BAM parser, which does not read the chromosomes in the expected alphabetical order, rather the order it was created in.

I'll have to modify it the.

Cheers,

Daniel