UCSC-LoweLab / tRNAscan-SE

A program for detection of tRNA genes
GNU General Public License v3.0
56 stars 7 forks source link

output file format: minus strand entries, extra spaces, header #19

Open darked89 opened 1 year ago

darked89 commented 1 year ago

Hello,

I have a modes proposal for the output file format improvements:

minus strand entries

tRNAscan-SE minus strand predictions in the output file have "tRNA Begin" > "tRNA End". Same goes for introns positions (if tRNA is spliced obviously). This is not an issue for the tRNAs themselves (BED files and fasta files have the correct 1:142656825-142656896 format/interval description) but the introns have to be flipped. Would it be easier to have a same, BED-like start-end-strand numbering scheme in the output?

extra spaces

To convert the output to a still human readable but easy to parse TSV I do:

tail -n +4 trnascan_out.txt | tr -d ' ' > trnascan_out.tsv

Since you have a complicated header in the file I understand the need for the spaces. Which brings me to the next point

header / TSV

TSV format with named columns seem to be the default. With comment lines # on the top it could be even easier to understand than the current one and certainly easier to parse. For example:

 "chrom", "trna_num",  "trna_start", "trna_end", "trna_type", "anticodon", 
"intr_start", "intr_end", "inf_score", "iso_CM", "iso_score", "note"

in order to fix minus strand issue the "strand" should be inserted somewhere.

These are just my 0.02$

Thank you for developing and maintaining tRNAScan-SE.

Darek Kedra

patriciaplchan commented 1 year ago

Thanks for your suggestion. We will consider adding an optional output file with a more up-to-date tsv format. The current output file format was designed when the first version of tRNAscan-SE was released over 25 years ago. Back then, graphical user interface was primitive and text files were served for data visualization and display. Because tRNAscan-SE has been integrated as part of the genome annotation pipelines at many genome centers, changing the output format will break a lot of the existing code. Therefore, we are still keeping the current file format.