cgat-developers / cgat-apps

cgat-apps repository
Other
33 stars 14 forks source link

Problem while trying to filter gtf file by longest transcript #71

Open eduardopdev opened 4 years ago

eduardopdev commented 4 years ago

First of all, i downloaded Araport11 from www.arabidopsis.org Then I used cgat-apps to generate a gtf file out of the file i downloaded $ cat Araport11_GFF3_genes_transposons.201606.gff | cgat gff32gtf > Araport.gtf and had no problems.

I saw that input gtf files must be sorted. so i sorted Araport.gtf: cat Araport.gtf | cgat gtf2gtf > Araport_sort.gtf --method=sort --sort-order=gene+transcript

and then, I filtered the sorted file by longest_transcript: cat Araport_sort.gtf | cgat gtf2gtf > Araport_longest.gtf --method=filter --filter-method=longest-transcript

the problem is that gtf2gtf is not always selecting the transcript with the bigger length. On gene AT1G01030, for example, the sum of the length of all exons in transcript AT1G01030.1 is bigger than that of AT1G01030.2, but on Araport_longest.gtf the AT1G01030.2 transcript is selected.

Here is the AT1G01030 gene in the original Araport_sort.gtf:

Chr1 Araport11 three_prime_UTR 11649 11863 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; ID "AT1G01030:three_prime_UTR:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:three_prime_UTR:1"; Chr1 Araport11 exon 11649 13173 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; ID "AT1G01030:exon:3"; Parent "AT1G01030.1"; Name "NGA3:exon:3"; Chr1 Araport11 CDS 11864 12940 . - 0 gene_id "AT1G01030"; transcript_id "AT1G01030.1"; ID "AT1G01030:CDS:2"; Parent "AT1G01030.1"; Name "NGA3:CDS:2"; Chr1 Araport11 five_prime_UTR 12941 13173 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; ID "AT1G01030:five_prime_UTR:2"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:five_prime_UTR:2"; Chr1 Araport11 exon 13335 13714 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; ID "AT1G01030:exon:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:exon:1"; Chr1 Araport11 five_prime_UTR 13335 13714 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; ID "AT1G01030:five_prime_UTR:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:five_prime_UTR:1"; Chr1 Araport11 exon 11649 12354 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:exon:4"; Parent "AT1G01030.2"; Name "NGA3:exon:4"; Chr1 Araport11 three_prime_UTR 11649 11863 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:three_prime_UTR:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:three_prime_UTR:1"; Chr1 Araport11 CDS 11864 12354 . - 2 gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:CDS:3"; Parent "AT1G01030.2"; Name "NGA3:CDS:3"; Chr1 Araport11 CDS 12424 12940 . - 0 gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:CDS:1"; Parent "AT1G01030.2"; Name "NGA3:CDS:1"; Chr1 Araport11 exon 12424 13173 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:exon:2"; Parent "AT1G01030.2"; Name "NGA3:exon:2"; Chr1 Araport11 five_prime_UTR 12941 13173 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:five_prime_UTR:2"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:five_prime_UTR:2"; Chr1 Araport11 exon 13335 13714 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:exon:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:exon:1"; Chr1 Araport11 five_prime_UTR 13335 13714 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:five_prime_UTR:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:five_prime_UTR:1";

And here is the same gene on the Araport_longest.gtf:

Chr1 Araport11 exon 11649 12354 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:exon:4"; Parent "AT1G01030.2"; Name "NGA3:exon:4"; Chr1 Araport11 CDS 11864 12354 . - 2 gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:CDS:3"; Parent "AT1G01030.2"; Name "NGA3:CDS:3"; Chr1 Araport11 CDS 12424 12940 . - 0 gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:CDS:1"; Parent "AT1G01030.2"; Name "NGA3:CDS:1"; Chr1 Araport11 exon 12424 13173 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:exon:2"; Parent "AT1G01030.2"; Name "NGA3:exon:2"; Chr1 Araport11 exon 13335 13714 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:exon:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:exon:1";

Acribbs commented 4 years ago

Indeed, I can see that the wrong transcript is being picked up. I will look into this further, I suspect that its to do with this bit of code: https://github.com/cgat-developers/cgat-apps/blob/4ac9f37ea625dda0206075f9d3eced3113c30a17/cgat/tools/gtf2gtf.py#L958

Things are quite busy at the moment, but when I have time I will use this gene and set of transcripts to debug further.

jarobin commented 4 years ago

Hello, I am wondering if there is an update on this issue, as I am also interested in extracting the longest transcript based on summed exon length.

Acribbs commented 4 years ago

Think this is fixed with the pull request here: https://github.com/cgat-developers/cgat-apps/pull/72

Haven't tested it extensively though.