NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
447 stars 55 forks source link

URL Escape Characters Converted #250

Open skchronicles opened 2 years ago

skchronicles commented 2 years ago

Describe the bug agat_convert_sp_gff2gtf.pl removes URL escape characters in the 9th column. In my testing, it removed a URL escape character in the 9th column which encodes for semicolons, i.e. ; character. After running agat_convert_sp_gff2gtf.pl, occurrences of %3B are converted to ;. As I understand, these URL encodings are used to prevent issues with parsing the GTF file later.

Is this behavior expected? Here is some documentation from your team. Please see the row about gff3 format. I already have a gff3 file (which is why the URL escape characters exist), but I would feel like the same rules would apply to GTF3 format. Wouldn't you want to avoid inserting a reserved delimiter character (like ';') within the value of a tag. This just makes parsing the file more of a headache later. I am not sure if the specification of gtf3 outlines how to handle said edge cases but it seems like retaining the URL escape character would be better.

I am interested to hear your thoughts.

Before (Rickettsia_rickettsii_str_iowa_gca_000017445.ASM1744v3.49.gff3): contains %3B

Chromosome  ena ncRNA_gene  286157  288917  .   +   .   ID=gene:RrIowa_0339;biotype=rRNA;description=Large Subunit Ribosomal RNA%3B lsuRNA%3B 23S ribosomal RNA;gene_id=RrIowa_0339;logic_name=ena_rna

After (Rickettsia_rickettsii_str_iowa_gca_000017445.ASM1744v3.49.gtf): converted %3B

Chromosome  ena gene    286157  288917  .   +   .   gene_id "RrIowa_0339"; ID "gene:RrIowa_0339"; biotype "rRNA"; description "Large Subunit Ribosomal RNA; lsuRNA; 23S ribosomal RNA"; logic_name "ena_rna"; original_biotype "ncrna_gene";

General (please complete the following information):

To Reproduce I would just insert that character in a gff3 file you have and then run the following:

# Steps for converting messy gff into properly formatted GTF file
# 1. Pull image from registry and create SIF
# module load singularity 
SINGULARITY_CACHEDIR=$PWD singularity pull \
    docker://quay.io/biocontainers/agat:0.8.0--pl5262hdfd78af_0 

# 2. Run AGAT todo the heavy lifting of gtf conversion
singularity exec -B $PWD \
    agat_0.8.0--pl5262hdfd78af_0.sif agat_convert_sp_gff2gtf.pl \
        --gff input.gff \
        -o converted.gtf

If you would like, I can provide you with the exact gff3 I am using. Please let me know what you think.

Expected behavior I am not sure if this is expected behavior or not based on the specification of gtf3. Maybe there is no guidance, and we live in the wild, wild west.

skchronicles commented 2 years ago

Here is some code to convert semicolons within quotes back into URL escape characters:

tmp = 'gene_id "RrIowa_0339"; ID "gene:RrIowa_0339"; biotype "rRNA"; description "Large Subunit Ribosomal RNA; lsuRNA; 23S ribosomal RNA"; logic_name "ena_rna"; original_biotype "ncrna_gene"'

# Assumes the quote character in the 9th column is a double quote or <"> character. This is the 
# correct character to use based on the speficiation. More information can be found on here:
# https://github.com/NBISweden/GAAS/blob/master/annotation/knowledge/gxf.md#main-points-and-differences-between-gtf-formats
def url_escape_inside_quotes(line, delimiter=';', url_encoding = '%3B'):
    quote_count = 0
    inside_quotes = False
    fixed = ''
    for c in line:
        if c == '"':
            # Entered the border or ending of 
            # a quote, increase the counter and
            # check where we are in the string
            quote_count += 1
            inside_quotes = True

            if quote_count > 1:
                # Reached end border of quote,
                # reset boolean flag and counters
                inside_quotes = False
                quote_count = 0

        if inside_quotes:
            # Replace reserved delimeter with 
            # another character, let's use a 
            # url encoding of the character
            if c == delimiter:
                c = url_encoding

        # Add the existing/converted character 
        fixed += c

    return fixed 

# gene_id "RrIowa_0339"; ID "gene:RrIowa_0339"; biotype "rRNA"; description "Large Subunit Ribosomal RNA%3B lsuRNA%3B 23S ribosomal RNA"; logic_name "ena_rna"; original_biotype "ncrna_gene"
print(url_escape_inside_quotes(tmp)) 
Juke34 commented 2 years ago

in GFF3 URL escaping rules are used for tags or values containing the following characters: ",=;". Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape.

The piece of code dealing with that in AGAT is the same for GFF and GTF so I will try to fix that. GTF do not have any official rule about it. As they quote textual value, it is not a problem to escape it or not.

skchronicles commented 2 years ago

Okay, that sounds good @Juke34.

Thank you for taking the time to look deeper into this issue. I appreciate it!