Closed novigit closed 2 years ago
Can you post the file (or part of the file) that reproduces the error?
Here is the first couple of lines of the file
tig00000012 EVM gene 248937 250896 . + . ID=ctg012.gene0001;Name=gene0001
tig00000012 EVM mRNA 248937 250896 . + . ID=ctg012.mRNA0001;Parent=ctg012.gene0001;Name=mRNA0001
tig00000012 EVM exon 248937 250153 . + . ID=ctg012.mRNA0001.exon01;Parent=ctg012.mRNA0001
tig00000012 EVM CDS 248937 250153 . + 0 ID=ctg012.mRNA0001.CDS01;Parent=ctg012.mRNA0001
tig00000012 EVM exon 250212 250310 . + . ID=ctg012.mRNA0001.exon02;Parent=ctg012.mRNA0001
tig00000012 EVM CDS 250212 250310 . + 1 ID=ctg012.mRNA0001.CDS02;Parent=ctg012.mRNA0001
tig00000012 EVM exon 250452 250896 . + . ID=ctg012.mRNA0001.exon03;Parent=ctg012.mRNA0001
tig00000012 EVM CDS 250452 250896 . + 1 ID=ctg012.mRNA0001.CDS03;Parent=ctg012.mRNA0001
tig00000012 EVM gene 250928 253264 . + . ID=ctg012.gene0002;Name=gene0002
tig00000012 EVM mRNA 250928 253264 . + . ID=ctg012.mRNA0002;Parent=ctg012.gene0002;Name=mRNA0002
tig00000012 EVM exon 250928 253264 . + . ID=ctg012.mRNA0002.exon01;Parent=ctg012.mRNA0002
tig00000012 EVM CDS 250928 253264 . + 0 ID=ctg012.mRNA0002.CDS01;Parent=ctg012.mRNA0002
tig00000012 EVM gene 253391 254953 . + . ID=ctg012.gene0003;Name=gene0003
tig00000012 EVM mRNA 253391 254953 . + . ID=ctg012.mRNA0003;Parent=ctg012.gene0003;Name=mRNA0003
tig00000012 EVM exon 253391 254953 . + . ID=ctg012.mRNA0003.exon01;Parent=ctg012.mRNA0003
tig00000012 EVM CDS 253391 254953 . + 0 ID=ctg012.mRNA0003.CDS01;Parent=ctg012.mRNA0003
tig00000012 EVM gene 255080 255926 . + . ID=ctg012.gene0004;Name=gene0004
tig00000012 EVM mRNA 255080 255926 . + . ID=ctg012.mRNA0004;Parent=ctg012.gene0004;Name=mRNA0004
tig00000012 EVM exon 255080 255843 . + . ID=ctg012.mRNA0004.exon01;Parent=ctg012.mRNA0004
tig00000012 EVM CDS 255080 255843 . + 0 ID=ctg012.mRNA0004.CDS01;Parent=ctg012.mRNA0004
tig00000012 EVM exon 255887 255926 . + . ID=ctg012.mRNA0004.exon02;Parent=ctg012.mRNA0004
tig00000012 EVM CDS 255887 255926 . + 1 ID=ctg012.mRNA0004.CDS02;Parent=ctg012.mRNA0004
tig00000012 EVM gene 256727 257758 . - . ID=ctg012.gene0005;Name=gene0005
tig00000012 EVM mRNA 256727 257758 . - . ID=ctg012.mRNA0005;Parent=ctg012.gene0005;Name=mRNA0005
tig00000012 EVM exon 256727 257758 . - . ID=ctg012.mRNA0005.exon01;Parent=ctg012.mRNA0005
tig00000012 EVM CDS 256727 257758 . - 0 ID=ctg012.mRNA0005.CDS01;Parent=ctg012.mRNA0005
tig00000012 EVM gene 257894 259663 . + . ID=ctg012.gene0006;Name=gene0006
tig00000012 EVM mRNA 257894 259663 . + . ID=ctg012.mRNA0006;Parent=ctg012.gene0006;Name=mRNA0006
tig00000012 EVM exon 257894 259141 . + . ID=ctg012.mRNA0006.exon01;Parent=ctg012.mRNA0006
tig00000012 EVM CDS 257894 259141 . + 0 ID=ctg012.mRNA0006.CDS01;Parent=ctg012.mRNA0006
tig00000012 EVM exon 259259 259663 . + . ID=ctg012.mRNA0006.exon02;Parent=ctg012.mRNA0006
tig00000012 EVM CDS 259259 259663 . + 0 ID=ctg012.mRNA0006.CDS02;Parent=ctg012.mRNA0006
When I save those lines to a file test.simple.gff3
and run your example code, it works fine. What version of gffutils are you using?
Yes I just noticed this as well! I'm currently doing some trial and error with the file to get some insights. I'm using the conda version 0.11.0
OK so I tried some things. Subsetting the GFF3 file to only records of the first contig (tig00000012) yielded no error. Subsetting to only those of the second contig (tig00000492) also yielded no error. However, when combined, the same error as above was returned.
This is the smallest subset of the file that returns the error for me. It contains records associated with the last gene of the first contig and the first two genes of the second contig
tig00000012 EVM gene 2181975 2182655 . + . ID=ctg012.gene0754;Name=gene0754
tig00000012 EVM mRNA 2181975 2182655 . + . ID=ctg012.mRNA0754;Parent=ctg012.gene0754;Name=mRNA0754
tig00000012 EVM exon 2181975 2182655 . + . ID=ctg012.mRNA0754.exon01;Parent=ctg012.mRNA0754
tig00000012 EVM CDS 2181975 2182655 . + 0 ID=ctg012.mRNA0754.CDS01;Parent=ctg012.mRNA0754
tig00000492 EVM gene 46225 47235 . - . ID=ctg492.gene0001;Name=gene0001
tig00000492 EVM mRNA 46225 47235 . - . ID=ctg492.mRNA0001;Parent=ctg492.gene0001;Name=mRNA0001
tig00000492 EVM exon 46225 47235 . - . ID=ctg492.mRNA0001.exon01;Parent=ctg492.mRNA0001
tig00000492 EVM CDS 46225 47235 . - 0 ID=ctg492.mRNA0001.CDS01;Parent=ctg492.mRNA0001
tig00000492 EVM gene 47351 48256 . - . ID=ctg492.gene0002;Name=gene0002
tig00000492 EVM mRNA 47351 48256 . - . ID=ctg492.mRNA0002;Parent=ctg492.gene0002;Name=mRNA0002
tig00000492 EVM exon 47351 48256 . - . ID=ctg492.mRNA0002.exon01;Parent=ctg492.mRNA0002
tig00000492 EVM CDS 47351 48256 . - 0 ID=ctg492.mRNA0002.CDS01;Parent=ctg492.mRNA0002
From the above file subset, db.interfeatures appears to create a single intergenic_space feature:
tig00000492 gffutils_derived intergenic_space 2182656 47350 . - . ID=ctg492.gene0001,ctg492.gene0002;Name=gene0001,gene0002
What is off is that it takes the end coordinate of the last gene in the first contig (gene0754), adds one, and uses that as the start value for the intergenic space between gene0001 and gene0002 of contig00000492
@daler Have you had any luck figuring out what is going on?
Should be fixed in #199 -- needed an overhaul of the entire method and I fixed some other things while I was in there. Your example now works, can you try with full file and reopen if it doesn't work?
The fixes are now in v0.11.1 (will take a couple hours to make it through the bioconda release, but it's available on PyPI).
It works now, also with the full file! Thanks so much!
I'm trying to infer intergenic space features with the interfeatures functionality, and then update the db with these features.
By default db.interfeatures() seems to set the ID of the intergenic feature using the ID of the neighboring genes. So for example, ID=gene0001,gene0002, The db.update() doesnt accept this. It only wants a single value for ID. So I tried to merge both IDs into a single values, i.e. 'gene0001-gene0002', but this leaves me with another cryptic error. See below.
My code:
Running this however leaves me with