Closed deannachurch closed 2 years ago
I also encountered this - I cannot get children of the features in the database that I built from a GTF file. Specifically, I took the following steps:
I downloaded the reference from 10X Genomics:
wget http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-GRCh38-3.0.0.tar.gz
gunzip refdata-cellranger-GRCh38-3.0.0.tar.gz
Here, the GTF file is located in ./refdata-cellranger-GRCh38-3.0.0/genes/genes.gtf
.
I then try to follow the example given in the API introduction:
import gffutils
gffutils.create_db(
'refdata-cellranger-GRCh38-3.0.0/genes/genes.gtf',
dbfn = 'gffdb.db',
force = True,
merge_strategy = 'create_unique',
id_spec = ['ID', 'Name'],
verbose = True
)
db = gffutils.FeatureDB('gffdb.db', keep_order = True)
gene = db['gene_1']
gene
Output: <Feature gene (1:29554-31109[+]) at 0x2af664f3ed00>
for i in db_conn.children(gene):
print(i)
Output: [empty]
While clearly, this gene has a bunch of sub-features in the GTF file. Any ideas? Thanks!
Hi, I got a similar error when there was no quote around the ids. For example, a file which does not work:
chr2 ensembl_havana gene 74668310 74671599 . + . gene_id ENSMUSG00000001819; gene_version 4; gene_name Hoxd13; gene_source ensembl_havana; gene_biotype protein_coding;
chr2 ensembl_havana transcript 74668310 74671599 . + . gene_id ENSMUSG00000001819; gene_version 4; transcript_id ENSMUST00000001872; transcript_version 4; gene_name Hoxd13; gene_source ensembl_havana; gene_biotype protein_coding; transcript_name Hoxd13-201; transcript_source ensembl_havana; transcript_biotype protein_coding; tag CCDS; ccds_id CCDS16139; tag basic; transcript_support_level 1;
chr2 ensembl_havana exon 74668310 74669078 . + . gene_id ENSMUSG00000001819; gene_version 4; transcript_id ENSMUST00000001872; transcript_version 4; exon_number 1; gene_name Hoxd13; gene_source ensembl_havana; gene_biotype protein_coding; transcript_name Hoxd13-201; transcript_source ensembl_havana; transcript_biotype protein_coding; tag CCDS; ccds_id CCDS16139; exon_id ENSMUSE00000336629; exon_version 2; tag basic; transcript_support_level 1;
chr2 ensembl_havana exon 74669886 74671599 . + . gene_id ENSMUSG00000001819; gene_version 4; transcript_id ENSMUST00000001872; transcript_version 4; exon_number 2; gene_name Hoxd13; gene_source ensembl_havana; gene_biotype protein_coding; transcript_name Hoxd13-201; transcript_source ensembl_havana; transcript_biotype protein_coding; tag CCDS; ccds_id CCDS16139; exon_id ENSMUSE00000601614; exon_version 3; tag basic; transcript_support_level 1;
With this file:
>>> import gffutils
>>> fn="not_working.gtf"
>>> db = gffutils.create_db(fn, ':memory:')
>>> tr=next(db.features_of_type("transcript"))
>>> next(db.children(tr))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
But when I add quotes:
chr2 ensembl_havana gene 74668310 74671599 . + . gene_id "ENSMUSG00000001819"; gene_version 4; gene_name Hoxd13; gene_source ensembl_havana; gene_biotype protein_coding;
chr2 ensembl_havana transcript 74668310 74671599 . + . gene_id "ENSMUSG00000001819"; gene_version 4; transcript_id "ENSMUST00000001872"; transcript_version 4; gene_name Hoxd13; gene_source ensembl_havana; gene_biotype protein_coding; transcript_name Hoxd13-201; transcript_source ensembl_havana; transcript_biotype protein_coding; tag CCDS; ccds_id CCDS16139; tag basic; transcript_support_level 1;
chr2 ensembl_havana exon 74668310 74669078 . + . gene_id "ENSMUSG00000001819"; gene_version 4; transcript_id "ENSMUST00000001872"; transcript_version 4; exon_number 1; gene_name Hoxd13; gene_source ensembl_havana; gene_biotype protein_coding; transcript_name Hoxd13-201; transcript_source ensembl_havana; transcript_biotype protein_coding; tag CCDS; ccds_id CCDS16139; exon_id ENSMUSE00000336629; exon_version 2; tag basic; transcript_support_level 1;
chr2 ensembl_havana exon 74669886 74671599 . + . gene_id "ENSMUSG00000001819"; gene_version 4; transcript_id "ENSMUST00000001872"; transcript_version 4; exon_number 2; gene_name Hoxd13; gene_source ensembl_havana; gene_biotype protein_coding; transcript_name Hoxd13-201; transcript_source ensembl_havana; transcript_biotype protein_coding; tag CCDS; ccds_id CCDS16139; exon_id ENSMUSE00000601614; exon_version 3; tag basic; transcript_support_level 1;
I get:
>>> import gffutils
>>> fn="working.gtf"
>>> db = gffutils.create_db(fn, ':memory:')
/home/ldelisle/.conda/envs/pgt_3.3/lib/python3.6/site-packages/gffutils/create.py:734: UserWarning: It appears you have a gene feature in your GTF file. You may want to use the `disable_infer_genes=True` option to speed up database creation
"It appears you have a gene feature in your GTF "
/home/ldelisle/.conda/envs/pgt_3.3/lib/python3.6/site-packages/gffutils/create.py:725: UserWarning: It appears you have a transcript feature in your GTF file. You may want to use the `disable_infer_transcripts=True` option to speed up database creation
"It appears you have a transcript feature in your GTF "
>>> tr=next(db.features_of_type("transcript"))
>>> next(db.children(tr))
<Feature transcript (chr2:74668310-74671599[+]) at 0x7f00c6bad780>
I hope this helps.
I'd like to note that this was not the problem in my case, described in my comment above - my GTF file does have quotes around the ID values and still, no children are found.
Another difference is that when you called the gffutils.FeatureDB.children()
function, you got an error, while I did not - in my case, the iterator was simply empty.
Can you print a head of your gtf?
It raises an error because I did next
. If I don't do that it gives me an empty list like you.
I used the following code & here's the output (but let me know if you'd like me to do it differently):
with open(gtf_file) as gtf:
for _ in range(10):
print(next(gtf))
[out:]
#!genome-build GRCh38.p12
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.27
#!genebuild-last-updated 2018-01
1 havana gene 29554 31109 . + . gene_id "ENSG00000243485"; gene_version "5"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"
1 havana transcript 29554 31097 . + . gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-202"; transcript_source "havana"; transcript_biotype "lincRNA"; tag "basic"; transcript_support_level "5"
1 havana exon 29554 30039 . + . gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; exon_number "1"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-202"; transcript_source "havana"; transcript_biotype "lincRNA"; exon_id "ENSE00001947070"; exon_version "1"; tag "basic"; transcript_support_level "5"
1 havana exon 30564 30667 . + . gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; exon_number "2"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-202"; transcript_source "havana"; transcript_biotype "lincRNA"; exon_id "ENSE00001922571"; exon_version "1"; tag "basic"; transcript_support_level "5"
1 havana exon 30976 31097 . + . gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; exon_number "3"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-202"; transcript_source "havana"; transcript_biotype "lincRNA"; exon_id "ENSE00001827679"; exon_version "1"; tag "basic"; transcript_support_level "5"
Minor edit: I originally showed the head of the wrong file (mouse genome instead of human), but the format is the same - I only fixed it for consistency. The main point is the same.
@Brunox13 You should not specify id_spec
:
Here is my test.gtf:
#!genome-build GRCm38.p6
#!genome-version GRCm38
#!genome-date 2012-01
#!genome-build-accession NCBI:GCA_000001635.8
#!genebuild-last-updated 2018-03
1 ensembl_havana gene 3205901 3671498 . - . gene_id "ENSMUSG00000051951"; gene_version "5"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"
1 havana transcript 3205901 3216344 . - . gene_id "ENSMUSG00000051951"; gene_version "5"; transcript_id "ENSMUST00000162897"; transcript_version "1"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-203"; transcript_source "havana"; transcript_biotype "processed_transcript"; transcript_support_level "1"
1 havana exon 3213609 3216344 . - . gene_id "ENSMUSG00000051951"; gene_version "5"; transcript_id "ENSMUST00000162897"; transcript_version "1"; exon_number "1"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-203"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSMUSE00000858910"; exon_version "1"; transcript_support_level "1"
1 havana exon 3205901 3207317 . - . gene_id "ENSMUSG00000051951"; gene_version "5"; transcript_id "ENSMUST00000162897"; transcript_version "1"; exon_number "2"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-203"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSMUSE00000866652"; exon_version "1"; transcript_support_level "1"
In python:
import gffutils
# When I specify id_spec:
gffutils.create_db(
'test.gtf',
dbfn = 'gffdb.db',
force = True,
merge_strategy = 'create_unique',
id_spec = ['ID', 'Name'],
verbose = True
)
db = gffutils.FeatureDB('gffdb.db', keep_order = True)
gene = next(db.features_of_type("gene"))
gene
Output:
<Feature gene (1:3205901-3671498[-]) at 0x7efe5a3da828>
In python:
for i in db.children(gene):
print(i)
No output But:
# Without specifying id_spec:
gffutils.create_db(
'test.gtf',
dbfn = 'gffdb2.db',
force = True,
merge_strategy = 'create_unique',
# id_spec = ['ID', 'Name'],
verbose = True
)
db2 = gffutils.FeatureDB('gffdb2.db', keep_order = True)
gene = next(db2.features_of_type("gene"))
gene
Output:
<Feature gene (1:3205901-3671498[-]) at 0x7efe5a3da828>
Back to python
for i in db2.children(gene):
print(i)
This time, output:
1 ensembl_havana gene 3205901 3671498 . - . gene_id "ENSMUSG00000051951"; gene_version "5"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"
1 havana transcript 3205901 3216344 . - . gene_id "ENSMUSG00000051951"; gene_version "5"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_id "ENSMUST00000162897"; transcript_version "1"; transcript_name "Xkr4-203"; transcript_source "havana"; transcript_biotype "processed_transcript"; transcript_support_level "1"
1 havana exon 3213609 3216344 . - . gene_id "ENSMUSG00000051951"; gene_version "5"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_id "ENSMUST00000162897"; transcript_version "1"; transcript_name "Xkr4-203"; transcript_source "havana"; transcript_biotype "processed_transcript"; transcript_support_level "1"; exon_number "1"; exon_id "ENSMUSE00000858910"; exon_version "1"
1 havana exon 3205901 3207317 . - . gene_id "ENSMUSG00000051951"; gene_version "5"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_id "ENSMUST00000162897"; transcript_version "1"; transcript_name "Xkr4-203"; transcript_source "havana"; transcript_biotype "processed_transcript"; transcript_support_level "1"; exon_number "2"; exon_id "ENSMUSE00000866652"; exon_version "1"
Thank you, @lldelisle, that was causing the issue. The main reason I had specified id_spec
was because initially, database creation was taking way too long and when I first tried creating a database with minimum non-default parameters, it took over a day and still did not finish. After doing some online research (I don't remember where I found this advice anymore), I changed a few parameters (I specified id_spec = ['ID', 'Name']
, changed the merge_strategy = 'create_unique'
, and added force = True
) and it worked in under ~20 mins, so I was happy and had no idea it would result in gffutils.FeatureDB.children()
not working down the line.
In light of this, I now tested how long gffutils.create_db()
takes to process my entire GTF file with and without including the id_spec
. While the sizes of the resulting .db files are almost identical at around 2.3 GB, the time it took is quite different:
id_spec = ['ID', 'Name']
: 0h:15m:19sid_spec = None
): 5h:28m:7sAs you can see, the difference is substantial, though I really only need to do this once per GTF file, so it's definitely worth it in exchange for the ability to find the children features. Thanks for your help!
To decrease the time, you can remove all genes/transcripts from your gtf with:
cat initial.gtf | awk '$3!="gene" && $3!="transcript"{print}' > filtered.gtf
Because when you have both exons and transcript/gene defined it takes longer because it needs to deal with potential conflicts between what is in the gene/transcript definition and what is in the exon definition. The database with initial.gtf and filtered.gtf is exactly the same if the gtf is well done as it looks to be from the first lines.
Thank you - this sounds like a good advice for someone using the gffutils.FeatureDB()
only to look at exonic features, like in RNA-seq! But I am guessing this filtering step would remove the information about the extent [start, end] of the gene and transcript features, is that right? Unfortunately, for my purposes, I need all the features (including their start-end info etc.), which is also why the proper functioning of gffutils.FeatureDB.children()
is so helpful!
No all these features can be deduced from exons. In each exon feature you have the transcript_id ... if you ask gffutils to build the db it will reconstruct a transcript where the start is the start of the first exon and the end the end of the last exon. You are not loosing any information... You can try and confirm.
That's great to know, I wasn't aware of that!! However, from my understanding, many (if not most) genes have multiple possible transcripts, which are caused not only by alternative splicing, but also alternative transcript starting & ending, all of which is also information that I am taking advantage of. Just as a quick example, the following code is giving me multiple different transcripts in terms of [start, end] per gene:
for gene in db.features_of_type('gene'):
for feat in db.children(gene):
if feat.featuretype == 'transcript':
if feat.start != gene.start or feat.end != gene.end:
print('Gene: {}-{}; Transcript: {}-{}'.format(gene.start, gene.end, feat.start, feat.end))
It seems inevitable that this information would be lost if only deduced from the 1st & last exons.
EDIT: I thought about this a little more and I realized that if each exon carries the information of which specific transcript & gene it actually belongs to, I see how the parent features could be reconstructed - transcripts from exons and, in turn, genes from transcripts. Do you know if this is the case with all types of GTF/GFF annotation files?
I never met a gtf where it did not work but I don't know... I am using gtf files from ensembl.
Thank you all for the discussion and helping out answer this. I just want to point out that in cases where the genes and transcripts are already present, disable_infer_genes=True, disable_infer_transcripts=True
will disable that inference behavior without requiring grepping them out in a previous step and dramatically (i.e. order of magnitude) speed up database creation.
@Brunox13 regarding the question in the edited version of your last comment, the canonical GTF format uses transcript as the parent of exons, and genes as the parent of transcripts...and transcripts and genes themselves never actually appear in the file. However this is certainly not globally the case, and the examples in the docs show some real-world cases that totally ignore the spec.
Hi- I've been trying to use gffutils to parse an NCBI GFF and ran into an issue getting the children. I'm testing with a small gff of three genes pulled from the original gff. Things started out OK:
And, i can get genes
and a specific gene
but no children:
Is this just user error or something with the NCBI GFF?