Gaius-Augustus / GUSHR

Generating UTRs from SHort Reads
GNU General Public License v3.0
9 stars 2 forks source link

Corrupted gene start/end fields #5

Open abs-yy opened 2 years ago

abs-yy commented 2 years ago

Hi, first of all, big thanks for the great software! I've been trying to get UTR predictions using --addUTR in braker1 and got corrupted gene and transcript rows; the start/end positions of the gene/transcripts are not correct. There were 82 cases like this in my ~20,000 gene set. The results of a test run using the test files from the GitHub repository also have this error, so I presumed it was a GUSHR error. I'm not fluent in python so I just filed an issue.

Here's some genes from the test run that have this error. You can see that the gene and transcript rows edited by GUSHR has mistaken start and stop positions.

% perl -lane 'print if @F[3]>@F[4] && @F[2] eq "gene"' ~/temp/utrs.gtf  | cut -f 9 | sort | uniq | grep -f - ~/temp/utrs.gtf
2R  GUSHR   gene    10078919    8467054     .   +   .   g693
2R  GUSHR   transcript  10078919    8467054     .   +   .   g693.t1
2R  AnnotationFinalizer 5'-UTR  8467039     8467054     .   +   .   transcript_id "g693.t1"; gene_id "g693";
2R  AnnotationFinalizer 5'-UTR  10078919    10079011    .   +   .   transcript_id "g693.t1"; gene_id "g693";
2R  AUGUSTUS    start_codon 10079012    10079014    .   +   0   transcript_id "g693.t1"; gene_id "g693";
2R  AUGUSTUS    CDS 10079012    10079182    1   +   0   transcript_id "g693.t1"; gene_id "g693";
2R  AUGUSTUS    CDS 10081507    10081632    1   +   0   transcript_id "g693.t1"; gene_id "g693";
2R  AUGUSTUS    CDS 10082642    10082967    1   +   0   transcript_id "g693.t1"; gene_id "g693";
2R  AUGUSTUS    CDS 10083030    10083134    1   +   1   transcript_id "g693.t1"; gene_id "g693";
2R  AUGUSTUS    CDS 10083262    10083326    1   +   1   transcript_id "g693.t1"; gene_id "g693";
2R  AUGUSTUS    CDS 10083404    10083493    1   +   2   transcript_id "g693.t1"; gene_id "g693";
2R  AUGUSTUS    CDS 10083574    10083938    1   +   2   transcript_id "g693.t1"; gene_id "g693";
2R  AUGUSTUS    stop_codon  10083936    10083938    .   +   0   transcript_id "g693.t1"; gene_id "g693";
2R  AnnotationFinalizer 3'-UTR  10083939    10084210    .   +   .   transcript_id "g693.t1"; gene_id "g693";
2R  GUSHR   gene    10197107    9752095     .   +   .   g769
2R  GUSHR   transcript  10197107    9752095     .   +   .   g769.t1
2R  AnnotationFinalizer 5'-UTR  9752086     9752095     .   +   .   transcript_id "g769.t1"; gene_id "g769";
2R  AnnotationFinalizer 5'-UTR  10197107    10197116    .   +   .   transcript_id "g769.t1"; gene_id "g769";
2R  AnnotationFinalizer 5'-UTR  10648880    10648911    .   +   .   transcript_id "g769.t1"; gene_id "g769";
2R  AUGUSTUS    start_codon 10648912    10648914    .   +   0   transcript_id "g769.t1"; gene_id "g769";
2R  AUGUSTUS    CDS 10648912    10649625    1   +   0   transcript_id "g769.t1"; gene_id "g769";
2R  AUGUSTUS    stop_codon  10649623    10649625    .   +   0   transcript_id "g769.t1"; gene_id "g769";
2R  AnnotationFinalizer 3'-UTR  10649626    10649826    .   +   .   transcript_id "g769.t1"; gene_id "g769";
2R  GUSHR   gene    10005157    9999915 .   -   .   g674
2R  GUSHR   transcript  10005157    9999915 .   -   .   g674.t1
2R  AnnotationFinalizer 3'-UTR  9995034     9996490     .   -   .   transcript_id "g674.t1"; gene_id "g674";
2R  AUGUSTUS    stop_codon  9996491 9996493 .   -   0   transcript_id "g674.t1"; gene_id "g674";
2R  AUGUSTUS    CDS 9996491 9996621 1   -   2   transcript_id "g674.t1"; gene_id "g674";
2R  AUGUSTUS    CDS 9997704 9997889 1   -   2   transcript_id "g674.t1"; gene_id "g674";
2R  AUGUSTUS    CDS 9998271 9998346 1   -   0   transcript_id "g674.t1"; gene_id "g674";
2R  AUGUSTUS    CDS 9998415 9998562 1   -   1   transcript_id "g674.t1"; gene_id "g674";
2R  AUGUSTUS    CDS 9998707 9998802 1   -   1   transcript_id "g674.t1"; gene_id "g674";
2R  AUGUSTUS    CDS 9999425 9999525 1   -   0   transcript_id "g674.t1"; gene_id "g674";
2R  AUGUSTUS    CDS 9999734 9999915 1   -   2   transcript_id "g674.t1"; gene_id "g674";
2R  AUGUSTUS    CDS 10005157    10005370    0.91    -   0   transcript_id "g674.t1"; gene_id "g674";
2R  AUGUSTUS    start_codon 10005368    10005370    .   -   0   transcript_id "g674.t1"; gene_id "g674";
2R  AnnotationFinalizer 5'-UTR  10005371    10005377    .   -   .   transcript_id "g674.t1"; gene_id "g674";
2R  AnnotationFinalizer 5'-UTR  10024051    10024202    .   -   .   transcript_id "g674.t1"; gene_id "g674";
2R  AnnotationFinalizer 5'-UTR  10025850    10026019    .   -   .   transcript_id "g674.t1"; gene_id "g674";

After meddling around with gushr.py, I noticed at line 680 and 684 (and also in 693, 697), the both two values being compared were str variables, not int. Adding int() to these rows and fixing line 551 (A space was added before the tabular) resulted in a perfect test run gushr.py -t example/augustus.gtf -b example/RNAseq.bam -g example/genome.fa -o utrs2 and braker --addUTR run.