daler / gffutils

GFF and GTF file manipulation and interconversion
http://daler.github.io/gffutils
MIT License
287 stars 78 forks source link

Inconsistent Ordering of exon_number attribute in introns #174

Closed mbiokyle29 closed 2 years ago

mbiokyle29 commented 2 years ago

Not sure this is technically a bug, but something that could cause issues. I noticed an inconsistency with the ordering of values in the exon_number attribute in introns (generated from create_introns). For most introns, the exon_number field will contain a list like ["1", "2"]. The idea being that this intron occurs after exon 1 and before exon 2, and is therefore "intron 1". This holds for all introns until the intron spanning exon 9 -> exon 10. The sorting is done using strings (at least based on how its formatted in an NCBI gtf). "10" will sort before "9", and thus the result will be ["10", "9"].

Example:

# all BCRA2 exons from GCF_000001405.39_GRCh38.p13_genomic.gtf 
NC_000013.11    BestRefSeq  exon    32315508    32315667    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "1"; 
NC_000013.11    BestRefSeq  exon    32316422    32316527    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "2"; 
NC_000013.11    BestRefSeq  exon    32319077    32319325    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "3"; 
NC_000013.11    BestRefSeq  exon    32325076    32325184    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "4"; 
NC_000013.11    BestRefSeq  exon    32326101    32326150    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "5"; 
NC_000013.11    BestRefSeq  exon    32326242    32326282    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "6"; 
NC_000013.11    BestRefSeq  exon    32326499    32326613    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "7"; 
NC_000013.11    BestRefSeq  exon    32329443    32329492    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "8"; 
NC_000013.11    BestRefSeq  exon    32330919    32331030    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "9"; 
NC_000013.11    BestRefSeq  exon    32332272    32333387    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "10"; 
NC_000013.11    BestRefSeq  exon    32336265    32341196    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "11"; 
NC_000013.11    BestRefSeq  exon    32344558    32344653    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "12"; 
NC_000013.11    BestRefSeq  exon    32346827    32346896    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "13"; 
NC_000013.11    BestRefSeq  exon    32354861    32355288    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "14"; 
NC_000013.11    BestRefSeq  exon    32356428    32356609    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "15"; 
NC_000013.11    BestRefSeq  exon    32357742    32357929    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "16"; 
NC_000013.11    BestRefSeq  exon    32362523    32362693    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "17"; 
NC_000013.11    BestRefSeq  exon    32363179    32363533    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "18"; 
NC_000013.11    BestRefSeq  exon    32370402    32370557    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "19"; 
NC_000013.11    BestRefSeq  exon    32370956    32371100    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "20"; 
NC_000013.11    BestRefSeq  exon    32376670    32376791    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "21"; 
NC_000013.11    BestRefSeq  exon    32379317    32379515    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "22"; 
NC_000013.11    BestRefSeq  exon    32379750    32379913    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "23"; 
NC_000013.11    BestRefSeq  exon    32380007    32380145    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "24"; 
NC_000013.11    BestRefSeq  exon    32394689    32394933    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "25"; 
NC_000013.11    BestRefSeq  exon    32396898    32397044    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "26"; 
NC_000013.11    BestRefSeq  exon    32398162    32400268    .   +   .   gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "27"; 

Code:

import gffutils

db_filename = 'brca2_bug.db'
feature_db = gffutils.create_db(
    'brca2.gtf',
    db_filename,
    merge_strategy='warning',
)

introns = [f for f in feature_db.create_introns()]
[i.attributes['exon_number'] for i in introns]
Out[12]:
[['1', '2'],
 ['2', '3'],
 ['3', '4'],
 ['4', '5'],
 ['5', '6'],
 ['6', '7'],
 ['7', '8'],
 ['8', '9'],
 ['10', '9'],  # <---- this is the issue
 ['10', '11'], 
 ['11', '12'],
 ['12', '13'],
 ['13', '14'],
 ['14', '15'],
 ['15', '16'],
 ['16', '17'],
 ['17', '18'],
 ['18', '19'],
 ['19', '20'],
 ['20', '21'],
 ['21', '22'],
 ['22', '23'],
 ['23', '24'],
 ['24', '25'],
 ['25', '26'],
 ['26', '27']]

I think the issue is here: https://github.com/daler/gffutils/blob/4b5b28e610a435af359ab1c31271deea1bae4c47/gffutils/helpers.py#L333

Might be tricky to get this to work in all cases, but seems like you could check str.isdigit on each item in v. If all are true, cast to int for the sorting, then cast back to string. Perhaps enable this via a flag.

daler commented 2 years ago

Thanks for the very clear writeup and example -- this is fixed in https://github.com/daler/gffutils/pull/192.

daler commented 2 years ago

Now addressed in v0.11.