ablab / IsoQuant

Transcript discovery and quantification with long RNA reads (Nanopores and PacBio)
https://ablab.github.io/IsoQuant/
Other
153 stars 13 forks source link

should extended_annotation.gtf be a superset of the input gtf? #175

Open jamestwebber opened 7 months ago

jamestwebber commented 7 months ago

This is what I assumed should happen, but it doesn't appear to be the case: my reference GTF has ~61k genes (GRCh38, gencode v39) but the output extended_annotation.gtf does not include all the known genes and transcripts (by a large margin: 23k genes). Is there some filtering going on here?

andrewprzh commented 7 months ago

Hi @jamestwebber

Yes, this is a known flaw in the current version, it is now fixed and will be out in 3.4 (hopefully soon).

Best Andrey

andrewprzh commented 6 months ago

Should be fixed now in IsoQuant 3.4

jamestwebber commented 2 months ago

I thought this was fixed, but I'm seeing some instances where the exon information for a gene was not copied over. I wonder if this is related to whether or not reads were assigned to the gene.

jamestwebber commented 2 months ago

I noticed this initially in an unprocessed pseudogene (WASH7P) just because it happens to be very close to the beginning of chr1. So if there's any filtering based on biotype, that could also be involved.

andrewprzh commented 2 months ago

@jamestwebber

There should not additional filtering, so sounds odd. What kind of information is missing, is it exon records? Is it possible to see take a look a this example?

Thanks Andrey

jamestwebber commented 2 months ago

Ah! This probably a false alarm: it looks like the transcript name was not copied over, but the exons themselves are present. I was looking for the gene name and didn't see the exons. For example the first exon in both files:

$ grep 'ENST00000488147.1' ~/reference/GRCh38.gencode.v39.annotation.basic.gtf | head -n 2 
chr1    HAVANA  transcript      14404   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-201"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:38034"; ont "PGO:0000005"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1    HAVANA  exon    29534   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-201"; exon_number 1; exon_id "ENSE00001890219.1"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:38034"; ont "PGO:0000005"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
$ grep 'ENST00000488147.1' OUT.extended_annotation.gtf | head -n 2
chr1    HAVANA  transcript      14404   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; exons "11"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-201"; transcript_support_level "NA"; hgnc_id "HGNC:38034"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1"; 
chr1    HAVANA  exon    29534   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; exon "1"; exon_id "chr1.40908";
andrewprzh commented 2 months ago

Yes, additional information such as gene names etc is only copied for genes and transcript records. I can make the same for exons if needed.

jamestwebber commented 2 months ago

The reason I noticed this is because I was looking at IGV, and it wasn't displaying the exons for WASH7P, only the gene body. I think this is really a bug in how IGV is parsing the GTF (it should be matching on transcript_id), but you will probably update sooner. 😂

andrewprzh commented 2 months ago

Yeah, I thought transcript_id would be enough. Maybe converting to GFF3 and having ID and Parent attributes instead will make it work.

Anyway, will fix exon information.

andrewprzh commented 1 month ago

Exon attributes should be now copied from the reference in IsoQuant 3.6.1.