daler / gffutils

GFF and GTF file manipulation and interconversion
http://daler.github.io/gffutils
MIT License
287 stars 78 forks source link

Inconsistent behaviour of trailing semicolon #207

Closed dariober closed 1 year ago

dariober commented 1 year ago

Sorry - me again. gffutils (v0.11.1) has inconsistent behavior regarding gff lines with or without a trailing semicolon. In some cases the trailing semicolon results in an attribute with empty string as key and None value, in some other cases the empty key is dropped. Here are some examples:

txt="""\
chr1 AUGUSTUS gene 68330 73621 1 - . ID=g1903;
chr1 AUGUSTUS mRNA 68330 73621 1 - . ID=g1903.t1;Parent=g1903;
chr1 Pfam protein_match 73372 73618 1 - . ID=g1903.t1.d1;Parent=g1903.t1;
chr1 Pfam protein_hmm_match 73372 73618 1 - . ID=g1903.t1.d1.1;Parent=g1903.t1.d1;
"""

db = gffutils.create_db(txt.replace(' ', '\t'), ':memory:', from_string=True)
for f in db.all_features():
    print(f.attributes.keys())

dict_keys(['ID'])
dict_keys(['ID', 'Parent'])
dict_keys(['ID', 'Parent'])
dict_keys(['ID', 'Parent'])
txt="""\
chr1 AUGUSTUS gene 68330 73621 1 - . ID=g1903;
chr1 AUGUSTUS mRNA 68330 73621 1 - . ID=g1903.t1;Parent=g1903;
chr1 Pfam protein_match 73372 73618 1 - . ID=g1903.t1.d1;Parent=g1903.t1
chr1 Pfam protein_hmm_match 73372 73618 1 - . ID=g1903.t1.d1.1;Parent=g1903.t1.d1
"""

db = gffutils.create_db(txt.replace(' ', '\t'), ':memory:', from_string=True)
for f in db.all_features():
    print(f.attributes.keys())

dict_keys(['ID', ''])
dict_keys(['ID', 'Parent', ''])
dict_keys(['ID', 'Parent'])
dict_keys(['ID', 'Parent'])
txt="""\
chr1 AUGUSTUS gene 68330 73621 1 - . ID=g1903;
chr1 AUGUSTUS mRNA 68330 73621 1 - . ID=g1903.t1;
chr1 Pfam protein_match 73372 73618 1 - . ID=g1903.t1.d1
chr1 Pfam protein_hmm_match 73372 73618 1 - . ID=g1903.t1.d1.1
"""

db = gffutils.create_db(txt.replace(' ', '\t'), ':memory:', from_string=True)
for f in db.all_features():
    print(f.attributes.keys())

dict_keys(['ID'])
dict_keys(['ID'])
dict_keys(['ID'])
dict_keys(['ID'])
txt="""\
chr1 AUGUSTUS gene 68330 73621 1 - . ID=g1903
chr1 AUGUSTUS mRNA 68330 73621 1 - . ID=g1903.t1
chr1 Pfam protein_match 73372 73618 1 - . ID=g1903.t1.d1;
chr1 Pfam protein_hmm_match 73372 73618 1 - . ID=g1903.t1.d1.1;
"""

db = gffutils.create_db(txt.replace(' ', '\t'), ':memory:', from_string=True)
for f in db.all_features():
    print(f.attributes.keys())

dict_keys(['ID'])
dict_keys(['ID'])
dict_keys(['ID', ''])
dict_keys(['ID', ''])

It's not a big deal but it would be nice to have a consistent handling of trailing ;. Personally, I don't see the point of an empty-string key with None value so I would be happy to always drop them.

I came across this behavior when in some cases I inserted a key in the attribute list and it printed with two consecutive semicolons, like ID=foo;Parent=bar;;gene_id=spam, which I guess is harmless but looks confusing.

daler commented 1 year ago

@dariober, see #209 for some explanation. After looking at this more closely, it's actually expected behavior. Some explanation:

First, the dialect inference is designed to handle inconsistencies by weighting more highly those lines with more attribute keys. The assumption is that more keys means more information for inferring dialect. See this line in helpers._choose_dialect(). That's why your second example above ends up inferring no trailing semicolon: the first line has only one attribute and its dialect is downweighted, resulting in the detected dialect having db.dialect['trailing semicolon'] = False.

Second, if there's a tie (which happens in the last two examples above each with 4 lines, two with trailing semicolons and two without), then the dialect falls back to the first one observed as a tiebreaker (this line).

In fact, as demonstrated over in #209, if you add another line to act as a tiebreaker then you can force the dialect one way or the other, so it's actually consistent. Or I guess "internally consistent" would be more accurate.

I think in this case, everything is behaving as expected and I'm not sure I would want to change anything in the code. Rather, for this example, you might want to force the dialect to have no trailing semicolon (e.g., set db.dialect['trailing semicolon'] = False before printing features).