Closed dxu104 closed 10 months ago
Hi @dxu104,
Thanks for this! I will get to you once I solve this issue.
Best, Alejandro
@dxu104,
I just reviewed your issue. gftsort outputs one CDS entry because your gene model has all CDS attribute entries bad formatted:
chr10p ambMex60DD CDS [...] gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1";
what I mean by "bad formatted" is that they lack the "exon_number" attribute. In theory, all exon/CDS/start_codon/stop_codon features should have that. gtfsort assumes that this exist and that is how builds the most inner layer.
When you fix the GTF with this information, the result is correct:
chr10p ambMex60DD gene 10258638 10502225 1000 - . gene_id "AMEX60DD000031"; gene_name "LOC102279365 [nr]|GLT1D1 [hs]";
chr10p ambMex60DD transcript 10258638 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [...]
chr10p ambMex60DD exon 10258638 10258703 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1"; exon_number "1";
chr10p ambMex60DD exon 10306400 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1"; exon_number "2";
chr10p ambMex60DD CDS 10306466 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1"; exon_number "2";
chr10p ambMex60DD exon 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1"; exon_number "3";
chr10p ambMex60DD CDS 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1"; exon_number "3";
chr10p ambMex60DD exon 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1"; exon_number "4";
chr10p ambMex60DD CDS 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1"; exon_number "4";
chr10p ambMex60DD exon 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1"; exon_number "5";
chr10p ambMex60DD CDS 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1"; exon_number "5";
Hope this helps!
Best, Alejandro
I think the reason you did not find the exon number is because the meaning of the entire command AmexT_v47-AmexG_v6.0-DD.gtf | cut -b 1-100 is: extract the first 100 bytes of each line from the AmexT_v47-AmexG_v6.0-DD.gtf file.
Here are whole entries: chr10p ambMex60DD gene 313039 315424 1000 + . gene_id "AMEX60DD000001"; gene_name "ZFP37 [nr]|ZNF568 [hs]"; chr10p ambMex60DD transcript 313039 315424 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZFP37 [nr]|ZNF568 [hs]|AMEX60DD301000001.1"; gene_name "ZFP37 [nr]|ZNF568 [hs]|AMEX60DD301000001.1"; chr10p ambMex60DD exon 313039 314183 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZFP37 [nr]|ZNF568 [hs]|AMEX60DD301000001.1"; exon_number "1"; chr10p ambMex60DD exon 315024 315424 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZFP37 [nr]|ZNF568 [hs]|AMEX60DD301000001.1"; exon_number "2"; chr10p ambMex60DD transcript 313800 315023 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZNF79 [hs]|AMEX60DD301000001.2"; gene_name "ZNF79 [hs]|AMEX60DD301000001.2"; chr10p ambMex60DD exon 313800 313931 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZNF79 [hs]|AMEX60DD301000001.2"; exon_number "1"; chr10p ambMex60DD exon 314856 315023 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZNF79 [hs]|AMEX60DD301000001.2"; exon_number "2"; chr10p ambMex60DD gene 313679 358550 1000 - . gene_id "AMEX60DD000002"; gene_name "LOC114595135 [nr]|ZNF268 [hs]";
@dxu104,
I downloaded this file: AmexT_v47-AmexG_v6.0-DD.gtf.gz.
cat AmexT_v47-AmexG_v6.0-DD.gtf.gz | less
first lines are:
chr10p ambMex60DD gene 313039 315424 1000 + . gene_id "AMEX60DD000001"; gene_name "ZFP37 [nr]|ZNF568 [hs]";
chr10p ambMex60DD transcript 313039 315424 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZFP37 [nr]|ZNF568 [hs]|AMEX60DD301000001.1"; gene_name "ZFP37 [nr]|ZNF568 [hs]|AMEX60DD301000001.1";
chr10p ambMex60DD exon 313039 314183 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZFP37 [nr]|ZNF568 [hs]|AMEX60DD301000001.1"; exon_number "1";
chr10p ambMex60DD exon 315024 315424 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZFP37 [nr]|ZNF568 [hs]|AMEX60DD301000001.1"; exon_number "2";
chr10p ambMex60DD transcript 313800 315023 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZNF79 [hs]|AMEX60DD301000001.2"; gene_name "ZNF79 [hs]|AMEX60DD301000001.2";
chr10p ambMex60DD exon 313800 313931 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZNF79 [hs]|AMEX60DD301000001.2"; exon_number "1";
chr10p ambMex60DD exon 314856 315023 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZNF79 [hs]|AMEX60DD301000001.2"; exon_number "2";
chr10p ambMex60DD gene 313679 358550 1000 - . gene_id "AMEX60DD000002"; gene_name "LOC114595135 [nr]|ZNF268 [hs]";
Basically the same output you have. Note that in that example (yours) there are 0 "CDS" lines.
Running grep -w "CDS" AmexT_v47-AmexG_v6.0-DD.gtf | head -4
and just extracting the CDS lines, outputs:
chr10p ambMex60DD CDS 313750 315750 1000 - . gene_id "AMEX60DD000002"; transcript_id "LOC114595135 [nr]|ZNF268 [hs]|AMEX60DD201000002.1";
chr10p ambMex60DD CDS 509597 512104 1000 + . gene_id "AMEX60DD000003"; transcript_id "LOC101953204 [nr]|ZNF850 [hs]|AMEX60DD201000003.1";
Note that here no "exon_number" attributes appear, and as I said in my previous message:
In theory, all exon/CDS/start_codon/stop_codon features should have that
In your example, only "exon" lines appear, these lines are good, the problem you are having is due to bad formatted CDS lines (like the above).
Here is an example of a complete CDS line (+ an exon line for comparison) [this is from a random assembly]:
MDVP01000001.1 ensembl exon 65995 66177 . - . gene_id "ENSCPRG00005000208"; gene_version "1"; transcript_id "ENSCPRT00005000271"; transcript_version "1"; exon_number "2"; gene_name "CHPF2";
MDVP01000001.1 ensembl CDS 65995 66177 . - 0 gene_id "ENSCPRG00005000208"; gene_version "1"; transcript_id "ENSCPRT00005000271"; transcript_version "1"; exon_number "2"; gene_name "CHPF2";
Note that both ("exon" and "CDS") lines have the "exon_number" attribute. It is not about the command (because I used just cat
while inspecting it), I think is because the lack of this attribute (which is necessary for gtfsort) [this is actually proved by just fixing those "exon_number" attributes and running gtfsort again, which outputs all the lines, as I put in the above messsage]. Kind of strange, though, because this is the standard format for a GTF file. Maybe has something to do with the person that build this gene model. Nothing that a few lines of code could fix.
Alejandro
@alejandrogzi, thank you for your help. You are correct; the axolotl omics reference GTF has all CDS attribute entries marked as bad. Consequently, I switched to the Zebrafish reference GTF, which does not have any missing information.
However, I encountered an issue after appending a new transcript found by StringTie, as shown below:
But I appended one new transcript found from StringTie, like
chr9p StringTie transcript 145679542 145683546 . - . transcript_id "TCONS_00065193"; gene_id "XLOC_034434"; oId "STRG.28759.1"; cmp_ref "AMEX60DD201053840.1"; class_code "p"; cmp_ref_gene "RNF26_[hs"; tss_id "TSS47892"; chr9p StringTie exon 145679542 145683546 . - . transcript_id "TCONS_00065193"; gene_id "XLOC_034434"; exon_number "1";
After using gtfsort, the appended part is missing again. Here is what I observed:
Before sort:
After sort:
Has anyone experienced a similar situation before? Thank you in advance.
Decehng
@dxu104,
Thanks for your interest!
I can download the reference GTF, would you mind to share your StringTie predicted transcripts file? I will find some time to debug that in order to make your experience with gtfsort nicer.
Alejandro
new_transcripts_from_StringTie.gtf.zip
Thank you for your help. The attached is transcripts to be appended and sorted.
Decheng
Hello @alejandrogzi - I am Decheng's mentor on this project. Thank you for your help working with it. The reference GTF is not of our creation, and it does have issues, so it's helpful to know the feature your program was looking for that is missing.
@dxu104 , the example you gave might be happening because you are appending a stringtie generated transcript to the ensembl output for zebrafish. The chromosome names will not match, as they are named in different ways. So unless gtfsort is able to add a new chromosome to absorb the stringtie transcript, it will not be able to properly include the new construct.
@jhgraber
Hi Joel,
Thank you for your response. Initially, I used the axolotl reference and appended new transcripts from StringTie. However, after processing with the gtfsort tool, these new transcripts were still missing. Subsequently, I attempted the same process with the zebrafish reference, but encountered the same issue.
Hi @jhgraber,
Nice to meet you! I've been talking with @dxu104 about these appending/sorting bugs you guys are having. In theory, gtfsort should handle that without any problems. I have been re-structuring all gtfsort code this morning, the goal is to publish a new release today implementing a parallel algorithm that makes it run x2 faster (~6s for 1.5GB GTF). Once I finish doing that, and before publishing it, I will take a deep look at your problems to see if something is failing with gtfsort.
I told @dxu104 that I will be writting the nf-module, galaxy-tool and docker (probably) of gtfsort soon, to be easily implemented in any pipeline. On the other hand, the Axolotl problem (person who build that gene model did not include "exon_number" in CDS lines) could be easily fix with a script that updates all those lines. With that, gtfsort will sort the file without any problem.
Once I get the solution I will contact you guys and thanks to be interested in my tool!
Best, Alejandro
@alejandrogzi
Here's what I did:
Attached file is to be sorted new_transcripts_from_StringTie 2.gtf.zip
Axolotl GTF before sort:
The contents in the blue box are from the reference axolotl GTF, where I fixed the CDS exon number issue. The contents in the red box are from the stringtie axolotl GTF that I appended.
Axolotl GTF after sort:
The result: The appended stringtie axolotl GTF is missing.
Has anyone experienced a similar situation before? Thank you in advance.
Decehng
Hi @jhgraber and @dxu104,
I have found some workarounds and tips to solve your problems. First, that GTF file needs to be corrected, it does not follow a correct GTF format. Specifically, all the "transcript_id" attributes should only preserve 1 id, not a lot of a ids separated by "|". This is how it looks like:
chr10p ambMex60DD [...] gene_id "AMEX60DD000002"; transcript_id "LOC114595135 [nr]|ZNF268 [hs]|AMEX60DD201000002.1"; gene_name "LOC114595135 [nr]|ZNF268 [hs]|AMEX60DD201000002.1"; [...];
and this is how is expected to look like:
chr10p ambMex60DD [...] gene_id "AMEX60DD000002"; transcript_id "AMEX60DD201000002.1"; gene_name "AMEX60DD201000002.1"; [...];
With 10 lines of code I corrected those lines. However, now the other problem is that all "CDS" lines does not have "exon_numbers". To correct that I used:
./gtfToGenePred AmexT_v47-AmexG_v6.0-DD_fixed.gtf stdout | ./genePredToBed stdin axolotl.bed
to end with a .bed file that looks like this:
chr10p 313038 315424 AMEX60DD301000001.1 0 + 315424 315424 0 2 1145,401, 0,1985,
chr10p 313799 315023 AMEX60DD301000001.2 0 + 315023 315023 0 2 132,168, 0,1056,
chr10p 313678 358550 AMEX60DD201000002.1 0 - 313749 315750 0 2 2660,190, 0,44682,
chr10p 489231 514890 AMEX60DD201000003.1 0 + 509596 512104 0 2 155,6096, 0,19563,
Why? Because then you can use bed2gtf to help you reformat all CDS lines. Note that bed2gtf needs an isoforms file (tab-delimited file with gene to transcript paris). To create the isoforms file and then covert the .bed file to a .gtf file again and sort it I did:
awk -F'"' -v OFS='\t' '{print $1, $2}' <(awk -F'\t' '{print $9}' <(grep -w "transcript" AmexT_v47-AmexG_v6.0-DD_fixed.gtf) | cut -d '"' -f 2,4) > axolotl.isoforms && bed2gtf --bed axolotl.bed --isoforms axolotl.isoforms --output axolotl.gtf && gtfsort -i axolotl.gtf -o axolotl.sorted.gtf
where the isoforms file looks like:
AMEX60DD000001 AMEX60DD301000001.1
AMEX60DD000001 AMEX60DD301000001.2
AMEX60DD000002 AMEX60DD201000002.1
AMEX60DD000003 AMEX60DD201000003.1
AMEX60DD000003 AMEX60DD301000003.2
and the output sorted .gtf file now looks like:
C0000568 bed2gtf gene 109816 110194 . + . gene_id "AMEX60DDU001000001";
C0000568 bed2gtf transcript 109816 110194 . + . gene_id "AMEX60DDU001000001"; transcript_id "AMEX60DDU001000001.1";
C0000568 bed2gtf exon 109816 110194 . + . gene_id "AMEX60DDU001000001"; transcript_id "AMEX60DDU001000001.1"; exon_number "0"; exon_id "AMEX60DDU001000001.1.1";
C0000570 bed2gtf gene 39532 39934 . + . gene_id "AMEX60DDU001000002";
C0000570 bed2gtf transcript 39532 39934 . + . gene_id "AMEX60DDU001000002"; transcript_id "AMEX60DDU001000002.1";
C0000570 bed2gtf exon 39532 39624 . + . gene_id "AMEX60DDU001000002"; transcript_id "AMEX60DDU001000002.1"; exon_number "0"; exon_id "AMEX60DDU001000002.1.1";
C0000570 bed2gtf exon 39699 39934 . + . gene_id "AMEX60DDU001000002"; transcript_id "AMEX60DDU001000002.1"; exon_number "1"; exon_id "AMEX60DDU001000002.1.2";
C0000570 bed2gtf gene 128078 128467 . - . gene_id "AMEX60DDU001000003";
C0000570 bed2gtf transcript 128078 128467 . - . gene_id "AMEX60DDU001000003"; transcript_id "AMEX60DDU001000003.1";
C0000570 bed2gtf exon 128078 128467 . - . gene_id "AMEX60DDU001000003"; transcript_id "AMEX60DDU001000003.1"; exon_number "0"; exon_id "AMEX60DDU001000003.1.1";
At this point, the original GTF file 1) has been corrected, 2) converted to .bed, 3) reconverted to .gtf and 4) sorted. Now, the second part is to append all the StringTie predicted transcripts and sort it. Another problem arises here, the predicted transcript GTF file is also bad formatted. I think this is because you used the original GTF file (that is bad-formatted) as input. What I suggest is to use this corrected GTF file: axolotl.gtf.gz; however, there is a "workaround" to make gtfsort work with that StringTie predictions. First, you need to fix that file to look like this (assuming that this information is from a novel transcript and does not need to be appended to an existent transcript):
C0002765 StringTie transcript 153352 503168 . + . transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; gene_name "LOC108792107 [nr]|CPT1A [hs]"; oId "STRG.35344.4"; cmp_ref "AMEX60DDU001000192.3"; class_code "j"; tss_id "TSS44";
C0002765 StringTie exon 153352 153518 . + . transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "1";
C0002765 StringTie exon 153605 154337 . + . transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "2";
C0002765 StringTie exon 392201 392376 . + . transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "3";
C0002765 StringTie exon 412065 412204 . + . transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "4";
C0002765 StringTie exon 419995 420169 . + . transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "5";
C0002765 StringTie exon 422863 422964 . + . transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "6";
C0002765 StringTie exon 503024 503168 . + . transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "7";
Note that the "gene_id" attribute follows the same naming convention used in the big GTF file. This is mandatory because the tool (gtfsort) does not regex lines or guesses names, it expects input GTFs following the basic GTF rules (this means that all the lines belonging to the same gene must have the same gene_id, not "gene_id ABC | LOC1231231 | AMEX12312312").
Once all this have been completed, gtfsort will work correctly:
cat StringTie_transcript_fixed.gtf >> axolotl.gtf && gtfsort -i axolotl.gtf -o axolotl.sorted.gtf
here is a snippet of the output:
C0002765 bed2gtf gene 153204 520850 . + . gene_id "AMEX60DDU001000192";
C0002765 bed2gtf transcript 153204 503239 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.1";
C0002765 bed2gtf exon 153204 154039 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.1"; exon_number "0"; exon_id "AMEX60DDU001000192.1.1";
C0002765 bed2gtf exon 392201 392376 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.1"; exon_number "1"; exon_id "AMEX60DDU001000192.1.2";
C0002765 bed2gtf exon 412065 412204 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.1"; exon_number "2"; exon_id "AMEX60DDU001000192.1.3";
C0002765 bed2gtf exon 419995 420169 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.1"; exon_number "3"; exon_id "AMEX60DDU001000192.1.4";
C0002765 bed2gtf exon 422863 422964 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.1"; exon_number "4"; exon_id "AMEX60DDU001000192.1.5";
C0002765 bed2gtf exon 503024 503239 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.1"; exon_number "5"; exon_id "AMEX60DDU001000192.1.6";
C0002765 bed2gtf transcript 153312 503239 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.2";
C0002765 bed2gtf exon 153312 154039 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.2"; exon_number "0"; exon_id "AMEX60DDU001000192.2.1";
C0002765 bed2gtf exon 392201 392376 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.2"; exon_number "1"; exon_id "AMEX60DDU001000192.2.2";
C0002765 bed2gtf exon 503024 503239 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.2"; exon_number "2"; exon_id "AMEX60DDU001000192.2.3";
C0002765 bed2gtf transcript 153329 503239 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.3";
C0002765 bed2gtf exon 153329 154337 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.3"; exon_number "0"; exon_id "AMEX60DDU001000192.3.1";
C0002765 bed2gtf exon 392201 392376 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.3"; exon_number "1"; exon_id "AMEX60DDU001000192.3.2";
C0002765 bed2gtf exon 412065 412204 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.3"; exon_number "2"; exon_id "AMEX60DDU001000192.3.3";
C0002765 bed2gtf exon 419995 420169 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.3"; exon_number "3"; exon_id "AMEX60DDU001000192.3.4";
C0002765 bed2gtf exon 422863 422964 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.3"; exon_number "4"; exon_id "AMEX60DDU001000192.3.5";
C0002765 bed2gtf exon 503024 503239 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.3"; exon_number "5"; exon_id "AMEX60DDU001000192.3.6";
C0002765 StringTie transcript 153352 503168 . + . transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; gene_name "LOC108792107 [nr]|CPT1A [hs]"; oId "STRG.35344.4"; cmp_ref "AMEX60DDU001000192.3"; class_code "j"; tss_id "TSS44";
C0002765 StringTie exon 153352 153518 . + . transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "1";
C0002765 StringTie exon 153605 154337 . + . transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "2";
C0002765 StringTie exon 392201 392376 . + . transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "3";
C0002765 StringTie exon 412065 412204 . + . transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "4";
C0002765 StringTie exon 419995 420169 . + . transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "5";
C0002765 StringTie exon 422863 422964 . + . transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "6";
C0002765 StringTie exon 503024 503168 . + . transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "7";
C0002765 bed2gtf transcript 219148 520850 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.4";
C0002765 bed2gtf exon 219148 219361 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.4"; exon_number "0"; exon_id "AMEX60DDU001000192.4.1";
C0002765 bed2gtf exon 392201 392376 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.4"; exon_number "1"; exon_id "AMEX60DDU001000192.4.2";
C0002765 bed2gtf exon 412065 412204 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.4"; exon_number "2"; exon_id "AMEX60DDU001000192.4.3";
C0002765 bed2gtf exon 419995 420169 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.4"; exon_number "3"; exon_id "AMEX60DDU001000192.4.4";
C0002765 bed2gtf exon 422863 422964 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.4"; exon_number "4"; exon_id "AMEX60DDU001000192.4.5";
C0002765 bed2gtf exon 503024 503161 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.4"; exon_number "5"; exon_id "AMEX60DDU001000192.4.6";
C0002765 bed2gtf exon 520812 520850 . + . gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.4"; exon_number "6"; exon_id "AMEX60DDU001000192.4.7";
C0002765 StringTie transcript 219190 503168 . + . transcript_id "TCONS_00000054"; gene_id "AMEX60DDU001000192"; gene_name "LOC115080999 [nr]|CPT1A [hs]"; oId "STRG.35344.5"; cmp_ref "AMEX60DDU001000192.4"; class_code "j"; tss_id "TSS45";
C0002765 StringTie exon 219190 219361 . + . transcript_id "TCONS_00000054"; gene_id "AMEX60DDU001000192"; exon_number "1";
C0002765 StringTie exon 392201 392376 . + . transcript_id "TCONS_00000054"; gene_id "AMEX60DDU001000192"; exon_number "2";
C0002765 StringTie exon 412065 412204 . + . transcript_id "TCONS_00000054"; gene_id "AMEX60DDU001000192"; exon_number "3";
C0002765 StringTie exon 419995 420169 . + . transcript_id "TCONS_00000054"; gene_id "AMEX60DDU001000192"; exon_number "4";
C0002765 StringTie exon 422863 422964 . + . transcript_id "TCONS_00000054"; gene_id "AMEX60DDU001000192"; exon_number "5";
C0002765 StringTie exon 503024 503168 . + . transcript_id "TCONS_00000054"; gene_id "AMEX60DDU001000192"; exon_number "6";
Also note that, for example, predicted transcript lines also are bad-formatted: gene_id attributes should always be at the beginning of the attribute line.
I hope this could help you, guys! Please let me know if there is another doubt or concern.
Best, Alejandro
Alejandro @alejandrogzi - thanks for all of your effort in tracking down these issues. The GTF was constructed not by us, and it has caused more than one headache previously. I appreciate the time it took for you to do this work.
I'd like to know what you are using as your source for defining the format. There are competing definitions available online-- this is a long standing source of frustration in the field, because with different definitions or constraints, the programs written to work with these files frequently assume or require different features. Most of the confusion, of course, comes from column 9.
After looking at the gmod pages on these formats, I'm inclined to move towards defaulting to trying to update our files as GFF3 instead of GTF/GFF2, since the definition is more clearly stated. Is your program designed to work with gff3 as well as gtf?
Thanks again for the efforts.
@alejandrogzi Thank you for your effort. It appears to have sorted successfully. As the gffsort tool currently lacks a container or image for integration with the Nextflow pipeline, I prefer to use my gtfinsert subworkflow (https://github.com/dxu104/gtfInsert/blob/main/nextflow/gtfinsert.nf) at the moment. I look forward to your nf-core module soon. Thanks again for all your efforts.
Hi @jhgraber and @dxu104,
Sorry for the delay response.
I'd like to know what you are using as your source for defining the format. There are competing definitions available online-- this is a long standing source of frustration in the field, because with different definitions or constraints, the programs written to work with these files frequently assume or require different features. Most of the confusion, of course, comes from column 9.
Agree. Here I am taking the Ensembl/GENCODE base convention. gtfsort assumes that - at least - a "gene_id", "transcript_id" and "exon_number" appear in that column.
After looking at the gmod pages on these formats, I'm inclined to move towards defaulting to trying to update our files as GFF3 instead of GTF/GFF2, since the definition is more clearly stated. Is your program designed to work with gff3 as well as gtf?
Sadly, no. I can modify to accept gff files, though. Another tool called "gff3sort" already exists that should work. Is not as efficient as gtfsort (takes x3-4 time) and have some problems with the attributes ordering (this is more detailed in the paper, because I considered gff3sort in the general benchmark).
I look forward to your nf-core module soon. Thanks again for all your efforts.
Thanks to you guys, it is very motivating see people using gtfsort. I would let you know when gtfsort nf-module is available.
Best, Alejandro
Hi alejandrogzi,
Hi Alejandro,
I've been testing the gtfsort tool on the latest axolotl GTF version (AmexT_v47-AmexG_v6.0-DD.gtf) downloaded from Axolotl-omics. However, I encountered an issue with data reduction after sorting.
Issue Description:
Input File: AmexT_v47-AmexG_v6.0-DD.gtf (1,977,265 rows). Output File: reference_gtf_after_gtfsort.gtf (1,425,753 rows).
Before Sort: After Sort:
Below content is what my mentor think about which part CDS is missing:
The gtfsort program is only printing out one CDS entry (in column 3) per transcript. Here’s an example, for two transcripts of gene AMEX60DD000031. The first capture is from the original file, the second from the sorted file. There should be four CDS entries for each transcript, but only the final one is included in each case. (Seems like perhaps all entries are being written into one location, so only the final one persists, perhaps)
[jhgraber@random testGtfsort]$ grep AMEX60DD000031 AmexT_v47-AmexG_v6.0-DD.gtf | cut -b 1-100
chr10p ambMex60DD gene 10258638 10502225 1000 - . gene_id "AMEX60DD000031"; gene_name "LOC102279365
chr10p ambMex60DD transcript 10258638 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC
chr10p ambMex60DD exon 10258638 10258703 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD exon 10306400 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD exon 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD exon 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD exon 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD CDS 10306466 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793
chr10p ambMex60DD CDS 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793
chr10p ambMex60DD CDS 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793
chr10p ambMex60DD CDS 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793
chr10p ambMex60DD transcript 10284305 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC
chr10p ambMex60DD exon 10284305 10284358 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD exon 10306400 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD exon 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD exon 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD exon 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD CDS 10284317 10284358 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793
chr10p ambMex60DD CDS 10306400 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793
chr10p ambMex60DD CDS 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793
chr10p ambMex60DD CDS 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793
chr10p ambMex60DD CDS 10502174 10502224 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793
[jhgraber@random testGtfsort]$ grep AMEX60DD000031 reference_gtf_after_gtfsort.gtf | cut -b 1-100
chr10p ambMex60DD gene 10258638 10502225 1000 - . gene_id "AMEX60DD000031"; gene_name "LOC102279365
chr10p ambMex60DD transcript 10258638 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC
chr10p ambMex60DD exon 10258638 10258703 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD exon 10306400 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD exon 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD exon 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD exon 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD CDS 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793
chr10p ambMex60DD transcript 10284305 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC
chr10p ambMex60DD exon 10284305 10284358 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD exon 10306400 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD exon 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD exon 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD exon 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279
chr10p ambMex60DD CDS 10502174 10502224 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793