Open Marison-1 opened 4 months ago
it would help to have the full call of filter_gtf_by_txid.py (it is in the GALBA.log file), and the related input files that lead to the error message.
it would help to have the full call of filter_gtf_by_txid.py (it is in the GALBA.log file), and the related input files that lead to the error message.
Hi, thank you very much for reply! This is the last report of [galba.log] file which lead to the error # Thu Mar 28 04:27:26 2024: deleting transcripts with CDS features on opposite strands and with overlapping CDS features... /public-supool/home/marison/micromamba/envs/galba/bin/python3 /public-supool/home/marison/software/GALBA/scripts/gtf_sanity_check.py -f /public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/galba/TLMX_Modified/GALBA/augustus.tmp.gtf -o /public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/galba/TLMX_Modified/GALBA/bad2.lst > /public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/galba/TLMX_Modified/GALBA/gtf_sanity_check.log 2> /public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/galba/TLMX_Modified/GALBA/errors/gtf_sanity_check.err /public-supool/home/marison/micromamba/envs/galba/bin/python3 /public-supool/home/marison/software/Augustus-master/scripts/filter_gtf_by_txid.py -g /public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/galba/TLMX_Modified/GALBA/augustus.tmp.gtf -l /public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/galba/TLMX_Modified/GALBA/bad2.lst -o /public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/galba/TLMX_Modified/GALBA/augustus.hints.gtf > /public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/galba/TLMX_Modified/GALBA/filter_gtf_by_txid.log 2> /public-dss/share/gap_lab/MaYuSheng/Medicago/GeneAnno/galba/TLMX_Modified/GALBA/errors/filter_gtf_by_txid.err
,
and this is the input data file screen shot where [KeyError: 'g27699.t3'] happend:
furthermore,this is the script named "filter_gtf_by_txid.py" in augustus scripts directory `#!/usr/bin/env python3
author = "Katharina J. Hoff with a little help of ChatGPT" copyright = "Copyright 2022-2023. All rights reserved." credits = "Katharina Hoff" version = "1.0.1" email = "katharina.hoff@uni-greifswald.de" status = "development"
import csv import re import argparse
parser = argparse.ArgumentParser(description='Remove transcripts form a gtf file') parser.add_argument('-g', '--gtf', required=True, type=str, help="Name of (sorted) gtf file") parser.add_argument('-l', '--exclude_lst', required=True, type=str, help="Text file with transcript IDs") parser.add_argument('-o', '--output_file', required=True, type=str, help="Output file in gtf format") args = parser.parse_args()
transcripts_to_remove = set()
with open(args.exclude_lst, "r") as transcript_ids_file: for line in transcript_ids_file: transcripts_to_remove.add(line.strip())
gene_data = {}
with open(args.gtf, "r") as gtf_file: reader = csv.reader(gtf_file, delimiter="\t") for row in reader: if row[2] == "CDS" or row[2]=="exon" or row[2]=="intron" or row[2]=="start_codon" or row[2]=="stop_codon":
transcript_id = row[8].split("transcript_id ")[1].split(";")[0].strip('"')
gene_id = row[8].split("gene_id ")[1].split(";")[0].strip('"')
elif row[2]=="transcript":
transcript_id = row[8].strip()
elif row[2]=="gene":
gene_id = row[8].strip()
if row[2]=="gene":
gene_data[gene_id] = {"gene_line": row}
elif row[2]=="transcript":
gene_data[gene_id][transcript_id]=[row]
else:
gene_data[gene_id][transcript_id].append(row)
for tx in transcripts_to_remove: match = re.match(r'([^"]*g\d+).t\d+', tx) gid = match.group(1) if gid in gene_data: if tx in gene_data[gid]: del gene_data[gid][tx] if gid in gene_data: if len(gene_data[gid]) == 1: del gene_data[gid]
with open(args.output_file, "w") as safe: for key, value in gene_data.items(): safe.write("\t".join(value['gene_line']) + "\n") del value['gene_line'] for innerkey in value.keys(): for ele in value[innerkey]: safe.write("\t".join(ele) + "\n")
`
the error report indicate that 49 line of above script has a KeyError in my data,I checked the them again and foud that the directory {"gene_data[gene_id]"} needs a key["transcript_id"] which require the column[2] is 'transcript' (elif row[2]=="transcript":), but in my data,[g27699.t3] seems dosen't have a transcript type, which may lead to the error.
If this kind of situation is common, I hope it will be helpful for script modification or it's just a data problem.
Thanks for your reply again!
Hi, thank you for share this nice tools! I met a error when I using galba `Traceback (most recent call last): File "/public-supool/home/marison/software/Augustus-master/scripts/filter_gtf_by_txid.py", line 49, in
gene_data[gene_id][transcript_id].append(row)