Inconsistent behaviour of trailing semicolon

Sorry - me again. gffutils (v0.11.1) has inconsistent behavior regarding gff lines with or without a trailing semicolon. In some cases the trailing semicolon results in an attribute with empty string as key and None value, in some other cases the empty key is dropped. Here are some examples:

All lines with trailing semicolon: The attribute list includes only the given keys, ID and Parent here. This is the same as all lines without trailing ;. So far so good:

txt="""\
chr1 AUGUSTUS gene 68330 73621 1 - . ID=g1903;
chr1 AUGUSTUS mRNA 68330 73621 1 - . ID=g1903.t1;Parent=g1903;
chr1 Pfam protein_match 73372 73618 1 - . ID=g1903.t1.d1;Parent=g1903.t1;
chr1 Pfam protein_hmm_match 73372 73618 1 - . ID=g1903.t1.d1.1;Parent=g1903.t1.d1;
"""

db = gffutils.create_db(txt.replace(' ', '\t'), ':memory:', from_string=True)
for f in db.all_features():
    print(f.attributes.keys())

dict_keys(['ID'])
dict_keys(['ID', 'Parent'])
dict_keys(['ID', 'Parent'])
dict_keys(['ID', 'Parent'])

First two lines end in ; the second two don't. The first two lines have an empty string:

txt="""\
chr1 AUGUSTUS gene 68330 73621 1 - . ID=g1903;
chr1 AUGUSTUS mRNA 68330 73621 1 - . ID=g1903.t1;Parent=g1903;
chr1 Pfam protein_match 73372 73618 1 - . ID=g1903.t1.d1;Parent=g1903.t1
chr1 Pfam protein_hmm_match 73372 73618 1 - . ID=g1903.t1.d1.1;Parent=g1903.t1.d1
"""

db = gffutils.create_db(txt.replace(' ', '\t'), ':memory:', from_string=True)
for f in db.all_features():
    print(f.attributes.keys())

dict_keys(['ID', ''])
dict_keys(['ID', 'Parent', ''])
dict_keys(['ID', 'Parent'])
dict_keys(['ID', 'Parent'])

Same as above, but now only the ID attribute is present and the attribute list does not include the empty string as key as key anymore:

txt="""\
chr1 AUGUSTUS gene 68330 73621 1 - . ID=g1903;
chr1 AUGUSTUS mRNA 68330 73621 1 - . ID=g1903.t1;
chr1 Pfam protein_match 73372 73618 1 - . ID=g1903.t1.d1
chr1 Pfam protein_hmm_match 73372 73618 1 - . ID=g1903.t1.d1.1
"""

db = gffutils.create_db(txt.replace(' ', '\t'), ':memory:', from_string=True)
for f in db.all_features():
    print(f.attributes.keys())

dict_keys(['ID'])
dict_keys(['ID'])
dict_keys(['ID'])
dict_keys(['ID'])

Same as previous example, but now only the second two lines have trailing ; and have the empty string attribute:

txt="""\
chr1 AUGUSTUS gene 68330 73621 1 - . ID=g1903
chr1 AUGUSTUS mRNA 68330 73621 1 - . ID=g1903.t1
chr1 Pfam protein_match 73372 73618 1 - . ID=g1903.t1.d1;
chr1 Pfam protein_hmm_match 73372 73618 1 - . ID=g1903.t1.d1.1;
"""

db = gffutils.create_db(txt.replace(' ', '\t'), ':memory:', from_string=True)
for f in db.all_features():
    print(f.attributes.keys())

dict_keys(['ID'])
dict_keys(['ID'])
dict_keys(['ID', ''])
dict_keys(['ID', ''])

It's not a big deal but it would be nice to have a consistent handling of trailing ;. Personally, I don't see the point of an empty-string key with None value so I would be happy to always drop them.

I came across this behavior when in some cases I inserted a key in the attribute list and it printed with two consecutive semicolons, like ID=foo;Parent=bar;;gene_id=spam, which I guess is harmless but looks confusing.

@dariober, see #209 for some explanation. After looking at this more closely, it's actually expected behavior. Some explanation:

First, the dialect inference is designed to handle inconsistencies by weighting more highly those lines with more attribute keys. The assumption is that more keys means more information for inferring dialect. See this line in helpers._choose_dialect(). That's why your second example above ends up inferring no trailing semicolon: the first line has only one attribute and its dialect is downweighted, resulting in the detected dialect having db.dialect['trailing semicolon'] = False.

Second, if there's a tie (which happens in the last two examples above each with 4 lines, two with trailing semicolons and two without), then the dialect falls back to the first one observed as a tiebreaker (this line).

In fact, as demonstrated over in #209, if you add another line to act as a tiebreaker then you can force the dialect one way or the other, so it's actually consistent. Or I guess "internally consistent" would be more accurate.

I think in this case, everything is behaving as expected and I'm not sure I would want to change anything in the code. Rather, for this example, you might want to force the dialect to have no trailing semicolon (e.g., set db.dialect['trailing semicolon'] = False before printing features).

daler / gffutils

Inconsistent behaviour of trailing semicolon #207