daler / gffutils

GFF and GTF file manipulation and interconversion
http://daler.github.io/gffutils
MIT License
287 stars 78 forks source link

Can't get children from NCBI GFF (may be user error) #158

Closed deannachurch closed 2 years ago

deannachurch commented 4 years ago

Hi- I've been trying to use gffutils to parse an NCBI GFF and ran into an issue getting the children. I'm testing with a small gff of three genes pulled from the original gff. Things started out OK:

if not test_db:
    db=gffutils.create_db(test, test_db, id_spec={'gene': 'db_xref'})
db=gffutils.FeatureDB(test_db)
for i in db.featuretypes():
    print("Feature: %s: %d" % (i, db.count_features_of_type(i)))

Feature: CDS: 38
Feature: exon: 41
Feature: gene: 3
Feature: mRNA: 6

And, i can get genes

for g in db.features_of_type('gene'):
    print(g)

chr1    BestRefSeq%2CGnomon gene    11686635    11725857    .   +   .   ID=gene-DRAXIN;Dbxref=GeneID:374946,HGNC:HGNC:25054,MIM:612682;Name=DRAXIN;description=dorsal inhibitory axon guidance protein;gbkey=Gene;gene=DRAXIN;gene_biotype=protein_coding;gene_synonym=AGPA3119,C1orf187,neucrin,UNQ3119
chr1    BestRefSeq  gene    15617458    15669044    .   +   .   ID=gene-DDI2;Dbxref=GeneID:84301,HGNC:HGNC:24578;Name=DDI2;description=DNA damage inducible 1 homolog 2;gbkey=Gene;gene=DDI2;gene_biotype=protein_coding
chr1    BestRefSeq  gene    19920009    19923617    .   -   .   ID=gene-PLA2G2E;Dbxref=GeneID:30814,HGNC:HGNC:13414,MIM:618320;Name=PLA2G2E;description=phospholipase A2 group IIE;gbkey=Gene;gene=PLA2G2E;gene_biotype=protein_coding;gene_synonym=GIIE sPLA2,sPLA2-IIE

and a specific gene

gene=db['gene_1']
print(gene)

chr1    BestRefSeq%2CGnomon gene    11686635    11725857    .   +   .   ID=gene-DRAXIN;Dbxref=GeneID:374946,HGNC:HGNC:25054,MIM:612682;Name=DRAXIN;description=dorsal inhibitory axon guidance protein;gbkey=Gene;gene=DRAXIN;gene_biotype=protein_coding;gene_synonym=AGPA3119,C1orf187,neucrin,UNQ3119

but no children:

for i in db.children(gene):
    print(i)

Is this just user error or something with the NCBI GFF?

Brunox13 commented 4 years ago

I also encountered this - I cannot get children of the features in the database that I built from a GTF file. Specifically, I took the following steps:

I downloaded the reference from 10X Genomics:

wget http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-GRCh38-3.0.0.tar.gz
gunzip refdata-cellranger-GRCh38-3.0.0.tar.gz

Here, the GTF file is located in ./refdata-cellranger-GRCh38-3.0.0/genes/genes.gtf.

I then try to follow the example given in the API introduction:

import gffutils

gffutils.create_db(
    'refdata-cellranger-GRCh38-3.0.0/genes/genes.gtf',
    dbfn = 'gffdb.db',
    force = True,
    merge_strategy = 'create_unique',
    id_spec = ['ID', 'Name'],
    verbose = True
    )
db = gffutils.FeatureDB('gffdb.db', keep_order = True)
gene = db['gene_1']
gene

Output: <Feature gene (1:29554-31109[+]) at 0x2af664f3ed00>

for i in db_conn.children(gene):
    print(i)

Output: [empty]

While clearly, this gene has a bunch of sub-features in the GTF file. Any ideas? Thanks!

lldelisle commented 4 years ago

Hi, I got a similar error when there was no quote around the ids. For example, a file which does not work:

chr2    ensembl_havana  gene    74668310        74671599        .       +       .       gene_id ENSMUSG00000001819; gene_version 4; gene_name Hoxd13; gene_source ensembl_havana; gene_biotype protein_coding;
chr2    ensembl_havana  transcript      74668310        74671599        .       +       .       gene_id ENSMUSG00000001819; gene_version 4; transcript_id ENSMUST00000001872; transcript_version 4; gene_name Hoxd13; gene_source ensembl_havana; gene_biotype protein_coding; transcript_name Hoxd13-201; transcript_source ensembl_havana; transcript_biotype protein_coding; tag CCDS; ccds_id CCDS16139; tag basic; transcript_support_level 1;
chr2    ensembl_havana  exon    74668310        74669078        .       +       .       gene_id ENSMUSG00000001819; gene_version 4; transcript_id ENSMUST00000001872; transcript_version 4; exon_number 1; gene_name Hoxd13; gene_source ensembl_havana; gene_biotype protein_coding; transcript_name Hoxd13-201; transcript_source ensembl_havana; transcript_biotype protein_coding; tag CCDS; ccds_id CCDS16139; exon_id ENSMUSE00000336629; exon_version 2; tag basic; transcript_support_level 1;
chr2    ensembl_havana  exon    74669886        74671599        .       +       .       gene_id ENSMUSG00000001819; gene_version 4; transcript_id ENSMUST00000001872; transcript_version 4; exon_number 2; gene_name Hoxd13; gene_source ensembl_havana; gene_biotype protein_coding; transcript_name Hoxd13-201; transcript_source ensembl_havana; transcript_biotype protein_coding; tag CCDS; ccds_id CCDS16139; exon_id ENSMUSE00000601614; exon_version 3; tag basic; transcript_support_level 1;

With this file:

>>> import gffutils
>>> fn="not_working.gtf"
>>> db = gffutils.create_db(fn, ':memory:')
>>> tr=next(db.features_of_type("transcript"))
>>> next(db.children(tr))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

But when I add quotes:

chr2    ensembl_havana  gene    74668310        74671599        .       +       .       gene_id "ENSMUSG00000001819"; gene_version 4; gene_name Hoxd13; gene_source ensembl_havana; gene_biotype protein_coding;
chr2    ensembl_havana  transcript      74668310        74671599        .       +       .       gene_id "ENSMUSG00000001819"; gene_version 4; transcript_id "ENSMUST00000001872"; transcript_version 4; gene_name Hoxd13; gene_source ensembl_havana; gene_biotype protein_coding; transcript_name Hoxd13-201; transcript_source ensembl_havana; transcript_biotype protein_coding; tag CCDS; ccds_id CCDS16139; tag basic; transcript_support_level 1;
chr2    ensembl_havana  exon    74668310        74669078        .       +       .       gene_id "ENSMUSG00000001819"; gene_version 4; transcript_id "ENSMUST00000001872"; transcript_version 4; exon_number 1; gene_name Hoxd13; gene_source ensembl_havana; gene_biotype protein_coding; transcript_name Hoxd13-201; transcript_source ensembl_havana; transcript_biotype protein_coding; tag CCDS; ccds_id CCDS16139; exon_id ENSMUSE00000336629; exon_version 2; tag basic; transcript_support_level 1;
chr2    ensembl_havana  exon    74669886        74671599        .       +       .       gene_id "ENSMUSG00000001819"; gene_version 4; transcript_id "ENSMUST00000001872"; transcript_version 4; exon_number 2; gene_name Hoxd13; gene_source ensembl_havana; gene_biotype protein_coding; transcript_name Hoxd13-201; transcript_source ensembl_havana; transcript_biotype protein_coding; tag CCDS; ccds_id CCDS16139; exon_id ENSMUSE00000601614; exon_version 3; tag basic; transcript_support_level 1;

I get:

>>> import gffutils
>>> fn="working.gtf"
>>> db = gffutils.create_db(fn, ':memory:')
/home/ldelisle/.conda/envs/pgt_3.3/lib/python3.6/site-packages/gffutils/create.py:734: UserWarning: It appears you have a gene feature in your GTF file. You may want to use the `disable_infer_genes=True` option to speed up database creation
  "It appears you have a gene feature in your GTF "
/home/ldelisle/.conda/envs/pgt_3.3/lib/python3.6/site-packages/gffutils/create.py:725: UserWarning: It appears you have a transcript feature in your GTF file. You may want to use the `disable_infer_transcripts=True` option to speed up database creation
  "It appears you have a transcript feature in your GTF "
>>> tr=next(db.features_of_type("transcript"))
>>> next(db.children(tr))
<Feature transcript (chr2:74668310-74671599[+]) at 0x7f00c6bad780>

I hope this helps.

Brunox13 commented 4 years ago

I'd like to note that this was not the problem in my case, described in my comment above - my GTF file does have quotes around the ID values and still, no children are found.

Another difference is that when you called the gffutils.FeatureDB.children() function, you got an error, while I did not - in my case, the iterator was simply empty.

lldelisle commented 4 years ago

Can you print a head of your gtf?

It raises an error because I did next. If I don't do that it gives me an empty list like you.

Brunox13 commented 4 years ago

I used the following code & here's the output (but let me know if you'd like me to do it differently):

with open(gtf_file) as gtf:
    for _ in range(10):
        print(next(gtf))

[out:]

#!genome-build GRCh38.p12

#!genome-version GRCh38

#!genome-date 2013-12

#!genome-build-accession NCBI:GCA_000001405.27

#!genebuild-last-updated 2018-01

1   havana  gene    29554   31109   .   +   .   gene_id "ENSG00000243485"; gene_version "5"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"

1   havana  transcript  29554   31097   .   +   .   gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-202"; transcript_source "havana"; transcript_biotype "lincRNA"; tag "basic"; transcript_support_level "5"

1   havana  exon    29554   30039   .   +   .   gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; exon_number "1"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-202"; transcript_source "havana"; transcript_biotype "lincRNA"; exon_id "ENSE00001947070"; exon_version "1"; tag "basic"; transcript_support_level "5"

1   havana  exon    30564   30667   .   +   .   gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; exon_number "2"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-202"; transcript_source "havana"; transcript_biotype "lincRNA"; exon_id "ENSE00001922571"; exon_version "1"; tag "basic"; transcript_support_level "5"

1   havana  exon    30976   31097   .   +   .   gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; exon_number "3"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-202"; transcript_source "havana"; transcript_biotype "lincRNA"; exon_id "ENSE00001827679"; exon_version "1"; tag "basic"; transcript_support_level "5"

Minor edit: I originally showed the head of the wrong file (mouse genome instead of human), but the format is the same - I only fixed it for consistency. The main point is the same.

lldelisle commented 4 years ago

@Brunox13 You should not specify id_spec: Here is my test.gtf:

#!genome-build GRCm38.p6
#!genome-version GRCm38
#!genome-date 2012-01
#!genome-build-accession NCBI:GCA_000001635.8
#!genebuild-last-updated 2018-03
1   ensembl_havana  gene    3205901 3671498 .   -   .   gene_id "ENSMUSG00000051951"; gene_version "5"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"
1   havana  transcript  3205901 3216344 .   -   .   gene_id "ENSMUSG00000051951"; gene_version "5"; transcript_id "ENSMUST00000162897"; transcript_version "1"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-203"; transcript_source "havana"; transcript_biotype "processed_transcript"; transcript_support_level "1"
1   havana  exon    3213609 3216344 .   -   .   gene_id "ENSMUSG00000051951"; gene_version "5"; transcript_id "ENSMUST00000162897"; transcript_version "1"; exon_number "1"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-203"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSMUSE00000858910"; exon_version "1"; transcript_support_level "1"
1   havana  exon    3205901 3207317 .   -   .   gene_id "ENSMUSG00000051951"; gene_version "5"; transcript_id "ENSMUST00000162897"; transcript_version "1"; exon_number "2"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-203"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSMUSE00000866652"; exon_version "1"; transcript_support_level "1"

In python:

import gffutils
# When I specify id_spec:
gffutils.create_db(
    'test.gtf',
    dbfn = 'gffdb.db',
    force = True,
    merge_strategy = 'create_unique',
    id_spec = ['ID', 'Name'],
    verbose = True
    )
db = gffutils.FeatureDB('gffdb.db', keep_order = True)
gene = next(db.features_of_type("gene"))
gene

Output:

<Feature gene (1:3205901-3671498[-]) at 0x7efe5a3da828>

In python:

for i in db.children(gene):
  print(i)

No output But:

# Without specifying id_spec:
gffutils.create_db(
    'test.gtf',
    dbfn = 'gffdb2.db',
    force = True,
    merge_strategy = 'create_unique',
    # id_spec = ['ID', 'Name'],
    verbose = True
    )
db2 = gffutils.FeatureDB('gffdb2.db', keep_order = True)
gene = next(db2.features_of_type("gene"))
gene

Output:

<Feature gene (1:3205901-3671498[-]) at 0x7efe5a3da828>

Back to python

for i in db2.children(gene):
  print(i)

This time, output:

1       ensembl_havana  gene    3205901 3671498 .       -       .       gene_id "ENSMUSG00000051951"; gene_version "5"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"
1       havana  transcript      3205901 3216344 .       -       .       gene_id "ENSMUSG00000051951"; gene_version "5"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_id "ENSMUST00000162897"; transcript_version "1"; transcript_name "Xkr4-203"; transcript_source "havana"; transcript_biotype "processed_transcript"; transcript_support_level "1"
1       havana  exon    3213609 3216344 .       -       .       gene_id "ENSMUSG00000051951"; gene_version "5"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_id "ENSMUST00000162897"; transcript_version "1"; transcript_name "Xkr4-203"; transcript_source "havana"; transcript_biotype "processed_transcript"; transcript_support_level "1"; exon_number "1"; exon_id "ENSMUSE00000858910"; exon_version "1"
1       havana  exon    3205901 3207317 .       -       .       gene_id "ENSMUSG00000051951"; gene_version "5"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_id "ENSMUST00000162897"; transcript_version "1"; transcript_name "Xkr4-203"; transcript_source "havana"; transcript_biotype "processed_transcript"; transcript_support_level "1"; exon_number "2"; exon_id "ENSMUSE00000866652"; exon_version "1"
Brunox13 commented 4 years ago

Thank you, @lldelisle, that was causing the issue. The main reason I had specified id_spec was because initially, database creation was taking way too long and when I first tried creating a database with minimum non-default parameters, it took over a day and still did not finish. After doing some online research (I don't remember where I found this advice anymore), I changed a few parameters (I specified id_spec = ['ID', 'Name'], changed the merge_strategy = 'create_unique', and added force = True) and it worked in under ~20 mins, so I was happy and had no idea it would result in gffutils.FeatureDB.children() not working down the line.

In light of this, I now tested how long gffutils.create_db() takes to process my entire GTF file with and without including the id_spec. While the sizes of the resulting .db files are almost identical at around 2.3 GB, the time it took is quite different:

As you can see, the difference is substantial, though I really only need to do this once per GTF file, so it's definitely worth it in exchange for the ability to find the children features. Thanks for your help!

lldelisle commented 4 years ago

To decrease the time, you can remove all genes/transcripts from your gtf with:

cat initial.gtf | awk '$3!="gene" && $3!="transcript"{print}' > filtered.gtf

Because when you have both exons and transcript/gene defined it takes longer because it needs to deal with potential conflicts between what is in the gene/transcript definition and what is in the exon definition. The database with initial.gtf and filtered.gtf is exactly the same if the gtf is well done as it looks to be from the first lines.

Brunox13 commented 4 years ago

Thank you - this sounds like a good advice for someone using the gffutils.FeatureDB() only to look at exonic features, like in RNA-seq! But I am guessing this filtering step would remove the information about the extent [start, end] of the gene and transcript features, is that right? Unfortunately, for my purposes, I need all the features (including their start-end info etc.), which is also why the proper functioning of gffutils.FeatureDB.children() is so helpful!

lldelisle commented 4 years ago

No all these features can be deduced from exons. In each exon feature you have the transcript_id ... if you ask gffutils to build the db it will reconstruct a transcript where the start is the start of the first exon and the end the end of the last exon. You are not loosing any information... You can try and confirm.

Brunox13 commented 4 years ago

That's great to know, I wasn't aware of that!! However, from my understanding, many (if not most) genes have multiple possible transcripts, which are caused not only by alternative splicing, but also alternative transcript starting & ending, all of which is also information that I am taking advantage of. Just as a quick example, the following code is giving me multiple different transcripts in terms of [start, end] per gene:

for gene in db.features_of_type('gene'):
    for feat in db.children(gene):
        if feat.featuretype == 'transcript':
            if feat.start != gene.start or feat.end != gene.end:
                print('Gene: {}-{}; Transcript: {}-{}'.format(gene.start, gene.end, feat.start, feat.end))

It seems inevitable that this information would be lost if only deduced from the 1st & last exons.

EDIT: I thought about this a little more and I realized that if each exon carries the information of which specific transcript & gene it actually belongs to, I see how the parent features could be reconstructed - transcripts from exons and, in turn, genes from transcripts. Do you know if this is the case with all types of GTF/GFF annotation files?

lldelisle commented 4 years ago

I never met a gtf where it did not work but I don't know... I am using gtf files from ensembl.

daler commented 2 years ago

Thank you all for the discussion and helping out answer this. I just want to point out that in cases where the genes and transcripts are already present, disable_infer_genes=True, disable_infer_transcripts=True will disable that inference behavior without requiring grepping them out in a previous step and dramatically (i.e. order of magnitude) speed up database creation.

@Brunox13 regarding the question in the edited version of your last comment, the canonical GTF format uses transcript as the parent of exons, and genes as the parent of transcripts...and transcripts and genes themselves never actually appear in the file. However this is certainly not globally the case, and the examples in the docs show some real-world cases that totally ignore the spec.