Open jamestwebber opened 7 months ago
Hi @jamestwebber
Yes, this is a known flaw in the current version, it is now fixed and will be out in 3.4 (hopefully soon).
Best Andrey
Should be fixed now in IsoQuant 3.4
I thought this was fixed, but I'm seeing some instances where the exon information for a gene was not copied over. I wonder if this is related to whether or not reads were assigned to the gene.
I noticed this initially in an unprocessed pseudogene (WASH7P) just because it happens to be very close to the beginning of chr1
. So if there's any filtering based on biotype, that could also be involved.
@jamestwebber
There should not additional filtering, so sounds odd. What kind of information is missing, is it exon records? Is it possible to see take a look a this example?
Thanks Andrey
Ah! This probably a false alarm: it looks like the transcript name was not copied over, but the exons themselves are present. I was looking for the gene name and didn't see the exons. For example the first exon in both files:
$ grep 'ENST00000488147.1' ~/reference/GRCh38.gencode.v39.annotation.basic.gtf | head -n 2
chr1 HAVANA transcript 14404 29570 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-201"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:38034"; ont "PGO:0000005"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 29534 29570 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-201"; exon_number 1; exon_id "ENSE00001890219.1"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:38034"; ont "PGO:0000005"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
$ grep 'ENST00000488147.1' OUT.extended_annotation.gtf | head -n 2
chr1 HAVANA transcript 14404 29570 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; exons "11"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-201"; transcript_support_level "NA"; hgnc_id "HGNC:38034"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 29534 29570 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; exon "1"; exon_id "chr1.40908";
Yes, additional information such as gene names etc is only copied for genes and transcript records. I can make the same for exons if needed.
The reason I noticed this is because I was looking at IGV, and it wasn't displaying the exons for WASH7P, only the gene body. I think this is really a bug in how IGV is parsing the GTF (it should be matching on transcript_id), but you will probably update sooner. 😂
Yeah, I thought transcript_id would be enough. Maybe converting to GFF3 and having ID
and Parent
attributes instead will make it work.
Anyway, will fix exon information.
Exon attributes should be now copied from the reference in IsoQuant 3.6.1.
This is what I assumed should happen, but it doesn't appear to be the case: my reference GTF has ~61k genes (GRCh38, gencode v39) but the output
extended_annotation.gtf
does not include all the known genes and transcripts (by a large margin: 23k genes). Is there some filtering going on here?