Closed dariober closed 1 year ago
@dariober, see #209 for some explanation. After looking at this more closely, it's actually expected behavior. Some explanation:
First, the dialect inference is designed to handle inconsistencies by weighting more highly those lines with more attribute keys. The assumption is that more keys means more information for inferring dialect. See this line in helpers._choose_dialect()
. That's why your second example above ends up inferring no trailing semicolon: the first line has only one attribute and its dialect is downweighted, resulting in the detected dialect having db.dialect['trailing semicolon'] = False
.
Second, if there's a tie (which happens in the last two examples above each with 4 lines, two with trailing semicolons and two without), then the dialect falls back to the first one observed as a tiebreaker (this line).
In fact, as demonstrated over in #209, if you add another line to act as a tiebreaker then you can force the dialect one way or the other, so it's actually consistent. Or I guess "internally consistent" would be more accurate.
I think in this case, everything is behaving as expected and I'm not sure I would want to change anything in the code. Rather, for this example, you might want to force the dialect to have no trailing semicolon (e.g., set db.dialect['trailing semicolon'] = False
before printing features).
Sorry - me again. gffutils (v0.11.1) has inconsistent behavior regarding gff lines with or without a trailing semicolon. In some cases the trailing semicolon results in an attribute with empty string as key and None value, in some other cases the empty key is dropped. Here are some examples:
;
. So far so good:;
the second two don't. The first two lines have an empty string:;
and have the empty string attribute:It's not a big deal but it would be nice to have a consistent handling of trailing
;
. Personally, I don't see the point of an empty-string key withNone
value so I would be happy to always drop them.I came across this behavior when in some cases I inserted a key in the attribute list and it printed with two consecutive semicolons, like
ID=foo;Parent=bar;;gene_id=spam
, which I guess is harmless but looks confusing.