alejandrogzi / gtfsort

a lexicographically-based GTF/GFF sorter
MIT License
28 stars 3 forks source link

After sorting, part CDS entries missing #4

Closed dxu104 closed 10 months ago

dxu104 commented 10 months ago

Hi alejandrogzi,

Hi Alejandro,

I've been testing the gtfsort tool on the latest axolotl GTF version (AmexT_v47-AmexG_v6.0-DD.gtf) downloaded from Axolotl-omics. However, I encountered an issue with data reduction after sorting.

Issue Description:

Input File: AmexT_v47-AmexG_v6.0-DD.gtf (1,977,265 rows). Output File: reference_gtf_after_gtfsort.gtf (1,425,753 rows).

Before Sort: image After Sort: image

Below content is what my mentor think about which part CDS is missing:

The gtfsort program is only printing out one CDS entry (in column 3) per transcript. Here’s an example, for two transcripts of gene AMEX60DD000031. The first capture is from the original file, the second from the sorted file. There should be four CDS entries for each transcript, but only the final one is included in each case. (Seems like perhaps all entries are being written into one location, so only the final one persists, perhaps)

[jhgraber@random testGtfsort]$ grep AMEX60DD000031 AmexT_v47-AmexG_v6.0-DD.gtf | cut -b 1-100

chr10p ambMex60DD gene 10258638 10502225 1000 - . gene_id "AMEX60DD000031"; gene_name "LOC102279365

chr10p ambMex60DD transcript 10258638 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC

chr10p ambMex60DD exon 10258638 10258703 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10306400 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD CDS 10306466 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

chr10p ambMex60DD CDS 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

chr10p ambMex60DD CDS 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

chr10p ambMex60DD CDS 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

chr10p ambMex60DD transcript 10284305 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC

chr10p ambMex60DD exon 10284305 10284358 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10306400 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD CDS 10284317 10284358 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

chr10p ambMex60DD CDS 10306400 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

chr10p ambMex60DD CDS 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

chr10p ambMex60DD CDS 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

chr10p ambMex60DD CDS 10502174 10502224 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

[jhgraber@random testGtfsort]$ grep AMEX60DD000031 reference_gtf_after_gtfsort.gtf | cut -b 1-100

chr10p ambMex60DD gene 10258638 10502225 1000 - . gene_id "AMEX60DD000031"; gene_name "LOC102279365

chr10p ambMex60DD transcript 10258638 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC

chr10p ambMex60DD exon 10258638 10258703 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10306400 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD CDS 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

chr10p ambMex60DD transcript 10284305 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC

chr10p ambMex60DD exon 10284305 10284358 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10306400 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD CDS 10502174 10502224 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

alejandrogzi commented 10 months ago

Hi @dxu104,

Thanks for this! I will get to you once I solve this issue.

Best, Alejandro

alejandrogzi commented 10 months ago

@dxu104,

I just reviewed your issue. gftsort outputs one CDS entry because your gene model has all CDS attribute entries bad formatted:

chr10p  ambMex60DD      CDS     [...]       gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1";

what I mean by "bad formatted" is that they lack the "exon_number" attribute. In theory, all exon/CDS/start_codon/stop_codon features should have that. gtfsort assumes that this exist and that is how builds the most inner layer.

When you fix the GTF with this information, the result is correct:

chr10p  ambMex60DD      gene    10258638        10502225        1000    -       .       gene_id "AMEX60DD000031"; gene_name "LOC102279365 [nr]|GLT1D1 [hs]";
chr10p  ambMex60DD      transcript      10258638        10502225        1000    -       .       gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [...]
chr10p  ambMex60DD      exon    10258638        10258703        1000    -       .       gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1"; exon_number "1";
chr10p  ambMex60DD      exon    10306400        10306498        1000    -       .       gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1"; exon_number "2";
chr10p  ambMex60DD      CDS     10306466        10306498        1000    -       .       gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1"; exon_number "2";
chr10p  ambMex60DD      exon    10403547        10403667        1000    -       .       gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1"; exon_number "3";
chr10p  ambMex60DD      CDS     10403547        10403667        1000    -       .       gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1"; exon_number "3";
chr10p  ambMex60DD      exon    10404202        10404245        1000    -       .       gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1"; exon_number "4";
chr10p  ambMex60DD      CDS     10404202        10404245        1000    -       .       gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1"; exon_number "4";
chr10p  ambMex60DD      exon    10502174        10502225        1000    -       .       gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1"; exon_number "5";
chr10p  ambMex60DD      CDS     10502174        10502225        1000    -       .       gene_id "AMEX60DD000031"; transcript_id "LOC102279365 [nr]|GLT1D1 [hs]|AMEX60DD301000031.1"; exon_number "5";

Hope this helps!

Best, Alejandro

dxu104 commented 10 months ago

I think the reason you did not find the exon number is because the meaning of the entire command AmexT_v47-AmexG_v6.0-DD.gtf | cut -b 1-100 is: extract the first 100 bytes of each line from the AmexT_v47-AmexG_v6.0-DD.gtf file.

Here are whole entries: chr10p ambMex60DD gene 313039 315424 1000 + . gene_id "AMEX60DD000001"; gene_name "ZFP37 [nr]|ZNF568 [hs]"; chr10p ambMex60DD transcript 313039 315424 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZFP37 [nr]|ZNF568 [hs]|AMEX60DD301000001.1"; gene_name "ZFP37 [nr]|ZNF568 [hs]|AMEX60DD301000001.1"; chr10p ambMex60DD exon 313039 314183 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZFP37 [nr]|ZNF568 [hs]|AMEX60DD301000001.1"; exon_number "1"; chr10p ambMex60DD exon 315024 315424 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZFP37 [nr]|ZNF568 [hs]|AMEX60DD301000001.1"; exon_number "2"; chr10p ambMex60DD transcript 313800 315023 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZNF79 [hs]|AMEX60DD301000001.2"; gene_name "ZNF79 [hs]|AMEX60DD301000001.2"; chr10p ambMex60DD exon 313800 313931 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZNF79 [hs]|AMEX60DD301000001.2"; exon_number "1"; chr10p ambMex60DD exon 314856 315023 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZNF79 [hs]|AMEX60DD301000001.2"; exon_number "2"; chr10p ambMex60DD gene 313679 358550 1000 - . gene_id "AMEX60DD000002"; gene_name "LOC114595135 [nr]|ZNF268 [hs]";

alejandrogzi commented 10 months ago

@dxu104,

I downloaded this file: AmexT_v47-AmexG_v6.0-DD.gtf.gz.

cat AmexT_v47-AmexG_v6.0-DD.gtf.gz | less first lines are:

chr10p  ambMex60DD      gene    313039  315424  1000    +       .       gene_id "AMEX60DD000001"; gene_name "ZFP37 [nr]|ZNF568 [hs]";
chr10p  ambMex60DD      transcript      313039  315424  1000    +       .       gene_id "AMEX60DD000001"; transcript_id "ZFP37 [nr]|ZNF568 [hs]|AMEX60DD301000001.1"; gene_name "ZFP37 [nr]|ZNF568 [hs]|AMEX60DD301000001.1";
chr10p  ambMex60DD      exon    313039  314183  1000    +       .       gene_id "AMEX60DD000001"; transcript_id "ZFP37 [nr]|ZNF568 [hs]|AMEX60DD301000001.1"; exon_number "1";
chr10p  ambMex60DD      exon    315024  315424  1000    +       .       gene_id "AMEX60DD000001"; transcript_id "ZFP37 [nr]|ZNF568 [hs]|AMEX60DD301000001.1"; exon_number "2";
chr10p  ambMex60DD      transcript      313800  315023  1000    +       .       gene_id "AMEX60DD000001"; transcript_id "ZNF79 [hs]|AMEX60DD301000001.2"; gene_name "ZNF79 [hs]|AMEX60DD301000001.2";
chr10p  ambMex60DD      exon    313800  313931  1000    +       .       gene_id "AMEX60DD000001"; transcript_id "ZNF79 [hs]|AMEX60DD301000001.2"; exon_number "1";
chr10p  ambMex60DD      exon    314856  315023  1000    +       .       gene_id "AMEX60DD000001"; transcript_id "ZNF79 [hs]|AMEX60DD301000001.2"; exon_number "2";
chr10p  ambMex60DD      gene    313679  358550  1000    -       .       gene_id "AMEX60DD000002"; gene_name "LOC114595135 [nr]|ZNF268 [hs]";

Basically the same output you have. Note that in that example (yours) there are 0 "CDS" lines.

Running grep -w "CDS" AmexT_v47-AmexG_v6.0-DD.gtf | head -4 and just extracting the CDS lines, outputs:

chr10p  ambMex60DD  CDS 313750  315750  1000    -   .   gene_id "AMEX60DD000002"; transcript_id "LOC114595135 [nr]|ZNF268 [hs]|AMEX60DD201000002.1";
chr10p  ambMex60DD  CDS 509597  512104  1000    +   .   gene_id "AMEX60DD000003"; transcript_id "LOC101953204 [nr]|ZNF850 [hs]|AMEX60DD201000003.1";

Note that here no "exon_number" attributes appear, and as I said in my previous message:

In theory, all exon/CDS/start_codon/stop_codon features should have that

In your example, only "exon" lines appear, these lines are good, the problem you are having is due to bad formatted CDS lines (like the above).

Here is an example of a complete CDS line (+ an exon line for comparison) [this is from a random assembly]:

MDVP01000001.1  ensembl exon    65995   66177   .       -       .       gene_id "ENSCPRG00005000208"; gene_version "1"; transcript_id "ENSCPRT00005000271"; transcript_version "1"; exon_number "2"; gene_name "CHPF2";
MDVP01000001.1  ensembl CDS     65995   66177   .       -       0       gene_id "ENSCPRG00005000208"; gene_version "1"; transcript_id "ENSCPRT00005000271"; transcript_version "1"; exon_number "2"; gene_name "CHPF2"; 

Note that both ("exon" and "CDS") lines have the "exon_number" attribute. It is not about the command (because I used just cat while inspecting it), I think is because the lack of this attribute (which is necessary for gtfsort) [this is actually proved by just fixing those "exon_number" attributes and running gtfsort again, which outputs all the lines, as I put in the above messsage]. Kind of strange, though, because this is the standard format for a GTF file. Maybe has something to do with the person that build this gene model. Nothing that a few lines of code could fix.

Alejandro

dxu104 commented 10 months ago

@alejandrogzi, thank you for your help. You are correct; the axolotl omics reference GTF has all CDS attribute entries marked as bad. Consequently, I switched to the Zebrafish reference GTF, which does not have any missing information.

However, I encountered an issue after appending a new transcript found by StringTie, as shown below:

But I appended one new transcript found from StringTie, like chr9p StringTie transcript 145679542 145683546 . - . transcript_id "TCONS_00065193"; gene_id "XLOC_034434"; oId "STRG.28759.1"; cmp_ref "AMEX60DD201053840.1"; class_code "p"; cmp_ref_gene "RNF26_[hs"; tss_id "TSS47892"; chr9p StringTie exon 145679542 145683546 . - . transcript_id "TCONS_00065193"; gene_id "XLOC_034434"; exon_number "1";

After using gtfsort, the appended part is missing again. Here is what I observed:

Before sort:

image

After sort:

image

Has anyone experienced a similar situation before? Thank you in advance.

Decehng

alejandrogzi commented 10 months ago

@dxu104,

Thanks for your interest!

I can download the reference GTF, would you mind to share your StringTie predicted transcripts file? I will find some time to debug that in order to make your experience with gtfsort nicer.

Alejandro

dxu104 commented 10 months ago

new_transcripts_from_StringTie.gtf.zip

Thank you for your help. The attached is transcripts to be appended and sorted.

Decheng

jhgraber commented 10 months ago

Hello @alejandrogzi - I am Decheng's mentor on this project. Thank you for your help working with it. The reference GTF is not of our creation, and it does have issues, so it's helpful to know the feature your program was looking for that is missing.

@dxu104 , the example you gave might be happening because you are appending a stringtie generated transcript to the ensembl output for zebrafish. The chromosome names will not match, as they are named in different ways. So unless gtfsort is able to add a new chromosome to absorb the stringtie transcript, it will not be able to properly include the new construct.

dxu104 commented 10 months ago

@jhgraber

Hi Joel,

Thank you for your response. Initially, I used the axolotl reference and appended new transcripts from StringTie. However, after processing with the gtfsort tool, these new transcripts were still missing. Subsequently, I attempted the same process with the zebrafish reference, but encountered the same issue.

alejandrogzi commented 10 months ago

Hi @jhgraber,

Nice to meet you! I've been talking with @dxu104 about these appending/sorting bugs you guys are having. In theory, gtfsort should handle that without any problems. I have been re-structuring all gtfsort code this morning, the goal is to publish a new release today implementing a parallel algorithm that makes it run x2 faster (~6s for 1.5GB GTF). Once I finish doing that, and before publishing it, I will take a deep look at your problems to see if something is failing with gtfsort.

I told @dxu104 that I will be writting the nf-module, galaxy-tool and docker (probably) of gtfsort soon, to be easily implemented in any pipeline. On the other hand, the Axolotl problem (person who build that gene model did not include "exon_number" in CDS lines) could be easily fix with a script that updates all those lines. With that, gtfsort will sort the file without any problem.

Once I get the solution I will contact you guys and thanks to be interested in my tool!

Best, Alejandro

dxu104 commented 10 months ago

@alejandrogzi

Here's what I did:

Attached file is to be sorted new_transcripts_from_StringTie 2.gtf.zip

Axolotl GTF before sort: image

The contents in the blue box are from the reference axolotl GTF, where I fixed the CDS exon number issue. The contents in the red box are from the stringtie axolotl GTF that I appended.

Axolotl GTF after sort: image

The result: The appended stringtie axolotl GTF is missing.

Has anyone experienced a similar situation before? Thank you in advance.

Decehng

alejandrogzi commented 10 months ago

Hi @jhgraber and @dxu104,

I have found some workarounds and tips to solve your problems. First, that GTF file needs to be corrected, it does not follow a correct GTF format. Specifically, all the "transcript_id" attributes should only preserve 1 id, not a lot of a ids separated by "|". This is how it looks like:

chr10p  ambMex60DD  [...]  gene_id "AMEX60DD000002"; transcript_id "LOC114595135 [nr]|ZNF268 [hs]|AMEX60DD201000002.1"; gene_name "LOC114595135 [nr]|ZNF268 [hs]|AMEX60DD201000002.1"; [...];

and this is how is expected to look like:

chr10p  ambMex60DD  [...]  gene_id "AMEX60DD000002"; transcript_id "AMEX60DD201000002.1"; gene_name "AMEX60DD201000002.1"; [...];

With 10 lines of code I corrected those lines. However, now the other problem is that all "CDS" lines does not have "exon_numbers". To correct that I used:

./gtfToGenePred AmexT_v47-AmexG_v6.0-DD_fixed.gtf stdout | ./genePredToBed stdin axolotl.bed

to end with a .bed file that looks like this:

chr10p  313038  315424  AMEX60DD301000001.1     0       +       315424  315424  0       2       1145,401,       0,1985,
chr10p  313799  315023  AMEX60DD301000001.2     0       +       315023  315023  0       2       132,168,        0,1056,
chr10p  313678  358550  AMEX60DD201000002.1     0       -       313749  315750  0       2       2660,190,       0,44682,
chr10p  489231  514890  AMEX60DD201000003.1     0       +       509596  512104  0       2       155,6096,       0,19563,

Why? Because then you can use bed2gtf to help you reformat all CDS lines. Note that bed2gtf needs an isoforms file (tab-delimited file with gene to transcript paris). To create the isoforms file and then covert the .bed file to a .gtf file again and sort it I did:

 awk -F'"' -v OFS='\t' '{print $1, $2}' <(awk -F'\t' '{print $9}' <(grep -w "transcript" AmexT_v47-AmexG_v6.0-DD_fixed.gtf) | cut -d '"' -f 2,4) > axolotl.isoforms && bed2gtf --bed axolotl.bed  --isoforms axolotl.isoforms --output axolotl.gtf && gtfsort -i axolotl.gtf -o axolotl.sorted.gtf

where the isoforms file looks like:

AMEX60DD000001  AMEX60DD301000001.1
AMEX60DD000001  AMEX60DD301000001.2
AMEX60DD000002  AMEX60DD201000002.1
AMEX60DD000003  AMEX60DD201000003.1
AMEX60DD000003  AMEX60DD301000003.2

and the output sorted .gtf file now looks like:

C0000568        bed2gtf gene    109816  110194  .       +       .       gene_id "AMEX60DDU001000001";
C0000568        bed2gtf transcript      109816  110194  .       +       .       gene_id "AMEX60DDU001000001"; transcript_id "AMEX60DDU001000001.1";
C0000568        bed2gtf exon    109816  110194  .       +       .       gene_id "AMEX60DDU001000001"; transcript_id "AMEX60DDU001000001.1"; exon_number "0"; exon_id "AMEX60DDU001000001.1.1";
C0000570        bed2gtf gene    39532   39934   .       +       .       gene_id "AMEX60DDU001000002";
C0000570        bed2gtf transcript      39532   39934   .       +       .       gene_id "AMEX60DDU001000002"; transcript_id "AMEX60DDU001000002.1";
C0000570        bed2gtf exon    39532   39624   .       +       .       gene_id "AMEX60DDU001000002"; transcript_id "AMEX60DDU001000002.1"; exon_number "0"; exon_id "AMEX60DDU001000002.1.1";
C0000570        bed2gtf exon    39699   39934   .       +       .       gene_id "AMEX60DDU001000002"; transcript_id "AMEX60DDU001000002.1"; exon_number "1"; exon_id "AMEX60DDU001000002.1.2";
C0000570        bed2gtf gene    128078  128467  .       -       .       gene_id "AMEX60DDU001000003";
C0000570        bed2gtf transcript      128078  128467  .       -       .       gene_id "AMEX60DDU001000003"; transcript_id "AMEX60DDU001000003.1";
C0000570        bed2gtf exon    128078  128467  .       -       .       gene_id "AMEX60DDU001000003"; transcript_id "AMEX60DDU001000003.1"; exon_number "0"; exon_id "AMEX60DDU001000003.1.1";

At this point, the original GTF file 1) has been corrected, 2) converted to .bed, 3) reconverted to .gtf and 4) sorted. Now, the second part is to append all the StringTie predicted transcripts and sort it. Another problem arises here, the predicted transcript GTF file is also bad formatted. I think this is because you used the original GTF file (that is bad-formatted) as input. What I suggest is to use this corrected GTF file: axolotl.gtf.gz; however, there is a "workaround" to make gtfsort work with that StringTie predictions. First, you need to fix that file to look like this (assuming that this information is from a novel transcript and does not need to be appended to an existent transcript):

C0002765        StringTie       transcript      153352  503168  .       +       .       transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; gene_name "LOC108792107 [nr]|CPT1A [hs]"; oId "STRG.35344.4"; cmp_ref "AMEX60DDU001000192.3"; class_code "j"; tss_id "TSS44";
C0002765        StringTie       exon    153352  153518  .       +       .       transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "1";
C0002765        StringTie       exon    153605  154337  .       +       .       transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "2";
C0002765        StringTie       exon    392201  392376  .       +       .       transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "3";
C0002765        StringTie       exon    412065  412204  .       +       .       transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "4";
C0002765        StringTie       exon    419995  420169  .       +       .       transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "5";
C0002765        StringTie       exon    422863  422964  .       +       .       transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "6";
C0002765        StringTie       exon    503024  503168  .       +       .       transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "7";

Note that the "gene_id" attribute follows the same naming convention used in the big GTF file. This is mandatory because the tool (gtfsort) does not regex lines or guesses names, it expects input GTFs following the basic GTF rules (this means that all the lines belonging to the same gene must have the same gene_id, not "gene_id ABC | LOC1231231 | AMEX12312312").

Once all this have been completed, gtfsort will work correctly:

cat StringTie_transcript_fixed.gtf >> axolotl.gtf && gtfsort -i axolotl.gtf -o axolotl.sorted.gtf

here is a snippet of the output:

C0002765    bed2gtf gene    153204  520850  .   +   .   gene_id "AMEX60DDU001000192";
C0002765    bed2gtf transcript  153204  503239  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.1";
C0002765    bed2gtf exon    153204  154039  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.1"; exon_number "0"; exon_id "AMEX60DDU001000192.1.1";
C0002765    bed2gtf exon    392201  392376  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.1"; exon_number "1"; exon_id "AMEX60DDU001000192.1.2";
C0002765    bed2gtf exon    412065  412204  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.1"; exon_number "2"; exon_id "AMEX60DDU001000192.1.3";
C0002765    bed2gtf exon    419995  420169  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.1"; exon_number "3"; exon_id "AMEX60DDU001000192.1.4";
C0002765    bed2gtf exon    422863  422964  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.1"; exon_number "4"; exon_id "AMEX60DDU001000192.1.5";
C0002765    bed2gtf exon    503024  503239  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.1"; exon_number "5"; exon_id "AMEX60DDU001000192.1.6";
C0002765    bed2gtf transcript  153312  503239  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.2";
C0002765    bed2gtf exon    153312  154039  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.2"; exon_number "0"; exon_id "AMEX60DDU001000192.2.1";
C0002765    bed2gtf exon    392201  392376  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.2"; exon_number "1"; exon_id "AMEX60DDU001000192.2.2";
C0002765    bed2gtf exon    503024  503239  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.2"; exon_number "2"; exon_id "AMEX60DDU001000192.2.3";
C0002765    bed2gtf transcript  153329  503239  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.3";
C0002765    bed2gtf exon    153329  154337  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.3"; exon_number "0"; exon_id "AMEX60DDU001000192.3.1";
C0002765    bed2gtf exon    392201  392376  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.3"; exon_number "1"; exon_id "AMEX60DDU001000192.3.2";
C0002765    bed2gtf exon    412065  412204  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.3"; exon_number "2"; exon_id "AMEX60DDU001000192.3.3";
C0002765    bed2gtf exon    419995  420169  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.3"; exon_number "3"; exon_id "AMEX60DDU001000192.3.4";
C0002765    bed2gtf exon    422863  422964  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.3"; exon_number "4"; exon_id "AMEX60DDU001000192.3.5";
C0002765    bed2gtf exon    503024  503239  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.3"; exon_number "5"; exon_id "AMEX60DDU001000192.3.6";
C0002765    StringTie   transcript  153352  503168  .   +   .   transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; gene_name "LOC108792107 [nr]|CPT1A [hs]"; oId "STRG.35344.4"; cmp_ref "AMEX60DDU001000192.3"; class_code "j"; tss_id "TSS44";
C0002765    StringTie   exon    153352  153518  .   +   .   transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "1";
C0002765    StringTie   exon    153605  154337  .   +   .   transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "2";
C0002765    StringTie   exon    392201  392376  .   +   .   transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "3";
C0002765    StringTie   exon    412065  412204  .   +   .   transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "4";
C0002765    StringTie   exon    419995  420169  .   +   .   transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "5";
C0002765    StringTie   exon    422863  422964  .   +   .   transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "6";
C0002765    StringTie   exon    503024  503168  .   +   .   transcript_id "TCONS_00000053"; gene_id "AMEX60DDU001000192"; exon_number "7";
C0002765    bed2gtf transcript  219148  520850  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.4";
C0002765    bed2gtf exon    219148  219361  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.4"; exon_number "0"; exon_id "AMEX60DDU001000192.4.1";
C0002765    bed2gtf exon    392201  392376  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.4"; exon_number "1"; exon_id "AMEX60DDU001000192.4.2";
C0002765    bed2gtf exon    412065  412204  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.4"; exon_number "2"; exon_id "AMEX60DDU001000192.4.3";
C0002765    bed2gtf exon    419995  420169  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.4"; exon_number "3"; exon_id "AMEX60DDU001000192.4.4";
C0002765    bed2gtf exon    422863  422964  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.4"; exon_number "4"; exon_id "AMEX60DDU001000192.4.5";
C0002765    bed2gtf exon    503024  503161  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.4"; exon_number "5"; exon_id "AMEX60DDU001000192.4.6";
C0002765    bed2gtf exon    520812  520850  .   +   .   gene_id "AMEX60DDU001000192"; transcript_id "AMEX60DDU001000192.4"; exon_number "6"; exon_id "AMEX60DDU001000192.4.7";
C0002765    StringTie   transcript  219190  503168  .   +   .   transcript_id "TCONS_00000054"; gene_id "AMEX60DDU001000192"; gene_name "LOC115080999 [nr]|CPT1A [hs]"; oId "STRG.35344.5"; cmp_ref "AMEX60DDU001000192.4"; class_code "j"; tss_id "TSS45";
C0002765    StringTie   exon    219190  219361  .   +   .   transcript_id "TCONS_00000054"; gene_id "AMEX60DDU001000192"; exon_number "1";
C0002765    StringTie   exon    392201  392376  .   +   .   transcript_id "TCONS_00000054"; gene_id "AMEX60DDU001000192"; exon_number "2";
C0002765    StringTie   exon    412065  412204  .   +   .   transcript_id "TCONS_00000054"; gene_id "AMEX60DDU001000192"; exon_number "3";
C0002765    StringTie   exon    419995  420169  .   +   .   transcript_id "TCONS_00000054"; gene_id "AMEX60DDU001000192"; exon_number "4";
C0002765    StringTie   exon    422863  422964  .   +   .   transcript_id "TCONS_00000054"; gene_id "AMEX60DDU001000192"; exon_number "5";
C0002765    StringTie   exon    503024  503168  .   +   .   transcript_id "TCONS_00000054"; gene_id "AMEX60DDU001000192"; exon_number "6";

Also note that, for example, predicted transcript lines also are bad-formatted: gene_id attributes should always be at the beginning of the attribute line.

I hope this could help you, guys! Please let me know if there is another doubt or concern.

Best, Alejandro

jhgraber commented 10 months ago

Alejandro @alejandrogzi - thanks for all of your effort in tracking down these issues. The GTF was constructed not by us, and it has caused more than one headache previously. I appreciate the time it took for you to do this work.

I'd like to know what you are using as your source for defining the format. There are competing definitions available online-- this is a long standing source of frustration in the field, because with different definitions or constraints, the programs written to work with these files frequently assume or require different features. Most of the confusion, of course, comes from column 9.

After looking at the gmod pages on these formats, I'm inclined to move towards defaulting to trying to update our files as GFF3 instead of GTF/GFF2, since the definition is more clearly stated. Is your program designed to work with gff3 as well as gtf?

Thanks again for the efforts.

dxu104 commented 10 months ago

@alejandrogzi Thank you for your effort. It appears to have sorted successfully. As the gffsort tool currently lacks a container or image for integration with the Nextflow pipeline, I prefer to use my gtfinsert subworkflow (https://github.com/dxu104/gtfInsert/blob/main/nextflow/gtfinsert.nf) at the moment. I look forward to your nf-core module soon. Thanks again for all your efforts.

alejandrogzi commented 10 months ago

Hi @jhgraber and @dxu104,

Sorry for the delay response.

I'd like to know what you are using as your source for defining the format. There are competing definitions available online-- this is a long standing source of frustration in the field, because with different definitions or constraints, the programs written to work with these files frequently assume or require different features. Most of the confusion, of course, comes from column 9.

Agree. Here I am taking the Ensembl/GENCODE base convention. gtfsort assumes that - at least - a "gene_id", "transcript_id" and "exon_number" appear in that column.

After looking at the gmod pages on these formats, I'm inclined to move towards defaulting to trying to update our files as GFF3 instead of GTF/GFF2, since the definition is more clearly stated. Is your program designed to work with gff3 as well as gtf?

Sadly, no. I can modify to accept gff files, though. Another tool called "gff3sort" already exists that should work. Is not as efficient as gtfsort (takes x3-4 time) and have some problems with the attributes ordering (this is more detailed in the paper, because I considered gff3sort in the general benchmark).

I look forward to your nf-core module soon. Thanks again for all your efforts.

Thanks to you guys, it is very motivating see people using gtfsort. I would let you know when gtfsort nf-module is available.

Best, Alejandro