Gaius-Augustus / GALBA

GALBA is a pipeline for fully automated prediction of protein coding gene structures with AUGUSTUS in novel eukaryotic genomes for the scenario where high quality proteins from one or several closely related species are available.
Other
121 stars 4 forks source link

Key ERROR in [filter_gtf_by_txid.err] #49

Open Marison-1 opened 4 months ago

Marison-1 commented 4 months ago

Hi, thank you for share this nice tools! I met a error when I using galba `Traceback (most recent call last): File "/public-supool/home/marison/software/Augustus-master/scripts/filter_gtf_by_txid.py", line 49, in gene_data[gene_id][transcript_id].append(row)


KeyError: 'g27699.t3'`

and this is command `/public-supool/home/marison/software/GALBA/scripts/galba.pl --gff3 --threads 20 --prg=miniprot --SCORER_PATH=/public-supool/home/marison/software/miniprot-boundary-scorer/ --MINIPROTHINT_PATH=/public-supool/home/marison/software/miniprothint/ --CDBTOOLS_PATH=/public-supool/home/marison/micromamba/envs/braker3/bin/ --genome=/public-dss/share/gap_lab/MaYuSheng/Medicago/Genome/Assembly/ragtag.scaffold.fasta --AUGUSTUS_CONFIG_PATH=/public-supool/home/marison/software/Augustus-master/CONFIG/ --AUGUSTUS_BIN_PATH=/public-supool/home/marison/software/Augustus-master/bin/ AUGUSTUS_SCRIPTS_PATH=/public-supool/home/marison/software/Augustus-master/scripts/ --species=TLMX --prot_seq=/public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/ProteinSeq/EvidenceProtein2.fasta `

I wonder if there are some details that I hadn't noticed or it's just a small bug in [filter_gtf_by_txid.py]? thank you very much!
KatharinaHoff commented 4 months ago

it would help to have the full call of filter_gtf_by_txid.py (it is in the GALBA.log file), and the related input files that lead to the error message.

Marison-1 commented 4 months ago

it would help to have the full call of filter_gtf_by_txid.py (it is in the GALBA.log file), and the related input files that lead to the error message.

Hi, thank you very much for reply! This is the last report of [galba.log] file which lead to the error # Thu Mar 28 04:27:26 2024: deleting transcripts with CDS features on opposite strands and with overlapping CDS features... /public-supool/home/marison/micromamba/envs/galba/bin/python3 /public-supool/home/marison/software/GALBA/scripts/gtf_sanity_check.py -f /public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/galba/TLMX_Modified/GALBA/augustus.tmp.gtf -o /public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/galba/TLMX_Modified/GALBA/bad2.lst > /public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/galba/TLMX_Modified/GALBA/gtf_sanity_check.log 2> /public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/galba/TLMX_Modified/GALBA/errors/gtf_sanity_check.err /public-supool/home/marison/micromamba/envs/galba/bin/python3 /public-supool/home/marison/software/Augustus-master/scripts/filter_gtf_by_txid.py -g /public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/galba/TLMX_Modified/GALBA/augustus.tmp.gtf -l /public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/galba/TLMX_Modified/GALBA/bad2.lst -o /public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/galba/TLMX_Modified/GALBA/augustus.hints.gtf > /public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/galba/TLMX_Modified/GALBA/filter_gtf_by_txid.log 2> /public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/galba/TLMX_Modified/GALBA/errors/filter_gtf_by_txid.err, and this is the input data file screen shot where [KeyError: 'g27699.t3'] happend: 360截图20240403224148501

furthermore,this is the script named "filter_gtf_by_txid.py" in augustus scripts directory `#!/usr/bin/env python3

author = "Katharina J. Hoff with a little help of ChatGPT" copyright = "Copyright 2022-2023. All rights reserved." credits = "Katharina Hoff" version = "1.0.1" email = "katharina.hoff@uni-greifswald.de" status = "development"

Import the necessary libraries

import csv import re import argparse

parser = argparse.ArgumentParser(description='Remove transcripts form a gtf file') parser.add_argument('-g', '--gtf', required=True, type=str, help="Name of (sorted) gtf file") parser.add_argument('-l', '--exclude_lst', required=True, type=str, help="Text file with transcript IDs") parser.add_argument('-o', '--output_file', required=True, type=str, help="Output file in gtf format") args = parser.parse_args()

Initialize a set to store the IDs of the transcripts to remove

transcripts_to_remove = set()

Read the transcript IDs from the file and add them to the set

with open(args.exclude_lst, "r") as transcript_ids_file: for line in transcript_ids_file: transcripts_to_remove.add(line.strip())

Initialize the dictionary to store the gene and transcript data

gene_data = {}

Open the GTF file and use the csv reader to read through the rows

with open(args.gtf, "r") as gtf_file: reader = csv.reader(gtf_file, delimiter="\t") for row in reader: if row[2] == "CDS" or row[2]=="exon" or row[2]=="intron" or row[2]=="start_codon" or row[2]=="stop_codon":

Extract the transcript ID from the row

  transcript_id = row[8].split("transcript_id ")[1].split(";")[0].strip('"')
  gene_id = row[8].split("gene_id ")[1].split(";")[0].strip('"')
elif row[2]=="transcript":
  transcript_id = row[8].strip()
elif row[2]=="gene":
  gene_id = row[8].strip()
if row[2]=="gene":
  gene_data[gene_id] = {"gene_line": row}
elif row[2]=="transcript":
  gene_data[gene_id][transcript_id]=[row]
else:
  gene_data[gene_id][transcript_id].append(row)

for tx in transcripts_to_remove: match = re.match(r'([^"]*g\d+).t\d+', tx) gid = match.group(1) if gid in gene_data: if tx in gene_data[gid]: del gene_data[gid][tx] if gid in gene_data: if len(gene_data[gid]) == 1: del gene_data[gid]

Open the GTF file and use the csv reader to read through the rows

with open(args.output_file, "w") as safe: for key, value in gene_data.items(): safe.write("\t".join(value['gene_line']) + "\n") del value['gene_line'] for innerkey in value.keys(): for ele in value[innerkey]: safe.write("\t".join(ele) + "\n")

`

the error report indicate that 49 line of above script has a KeyError in my data,I checked the them again and foud that the directory {"gene_data[gene_id]"} needs a key["transcript_id"] which require the column[2] is 'transcript' (elif row[2]=="transcript":), but in my data,[g27699.t3] seems dosen't have a transcript type, which may lead to the error.

If this kind of situation is common, I hope it will be helpful for script modification or it's just a data problem.

Thanks for your reply again!