Closed dstern closed 4 years ago
Hi David, I was able to reproduce this behavior and I believe i understand the bug. It has to do with the formatting differences between gtf and gff3 files and the subsequence parsing in the reverse lift over. Liftoff outputs the attribute column as a semi-colon separated list of tag-value pairs (gff3 format), and it uses the original input annotation to get these pairs. The problem is that when the input is a gtf file, the tag-value pairs don't always correspond to the predefined tags in gff3 format (e.g. 'gene-id' in gtf format versus 'ID' in gff3). Then when you feed this output file back into Liftoff for the reverse mapping, the gtf/gff parser gets confused determining which format the file is in because it see's gtf style tags in gff3 format, and the result is that it doesnt properly associate exons/CDS with their parent gene and are thus not lifted over. I will work on a fix for this, but in the meantime, if you use the gff file as input instead of the gtf, you will be able to map all features in both directions as you described.
I am seeing some odd behavior. I'm not sure it's a bug, but it's causing me some confusion. I tried mapping the D. melanogaster gtf (dmel-all-r6.35.gtf) to a closely related Drosophila species. I would expect most exons to map and to test this I am taking the gtf for the other species and trying to map back to D. melanogaster. However, in the reverse mapping, only the gene annotations map. The exons, mRNA, etc. annotations, which are present in the "forward" mapping do not map back.
Here are some examples. From forward mapping: Scf_2L Liftoff gene 10263 12636 . - . gene_id=FBgn0002121;gene_symbol=l(2)gl;coverage=0.114;sequence_ID=0.097;extra_copy_number=0;copy_num_ID=FBgn0002121_0;partial_mapping=True;low_identity=True Scf_2L Liftoff exon 11172 11331 . - . gene_id=FBgn0002121;gene_symbol=l(2)gl;transcript_id=FBtr0078167;transcript_symbol=l(2)gl-RD;Parent=FBgn0002121;extra_copy_number=0 Scf_2L Liftoff exon 11172 11326 . - . gene_id=FBgn0002121;gene_symbol=l(2)gl;transcript_id=FBtr0078168;transcript_symbol=l(2)gl-RE;Parent=FBgn0002121;extra_copy_number=0 Scf_2L Liftoff exon 11172 11331 . - . gene_id=FBgn0002121;gene_symbol=l(2)gl;transcript_id=FBtr0078169;transcript_symbol=l(2)gl-RF;Parent=FBgn0002121;extra_copy_number=0 Scf_2L Liftoff exon 11172 11326 . - . gene_id=FBgn0002121;gene_symbol=l(2)gl;transcript_id=FBtr0306589;transcript_symbol=l(2)gl-RG;Parent=FBgn0002121;extra_copy_number=0 Scf_2L Liftoff exon 11172 11331 . - . gene_id=FBgn0002121;gene_symbol=l(2)gl;transcript_id=FBtr0306591;transcript_symbol=l(2)gl-RI;Parent=FBgn0002121;extra_copy_number=0
etc. for many lines, including CDS, 5UTR, mRNA, etc.
From reverse mapping for same gene, I get only the following line for this gene.
2L Liftoff gene 19041 21376 . - . gene_id=FBgn0002121;gene_symbol=l(2)gl;coverage=0.932;sequence_ID=0.799;extra_copy_number=0;copy_num_ID=gene_1_0
Any thoughts?
Many thanks!
David