Open photocyte opened 6 months ago
Here is a commit on my fork of the repo that implements a bandaid, in case it is handy to anyone.. But, it's not clean enough for a PR and I'm unsure of where else in the codebase this issue should be properly fixed, if it is indeed an issue.
https://github.com/photocyte/gffutils/commit/353eb8b6b32b591941d6daa2b21ccbc14912d164
By my reading of the spec, it's not clear on how quotes should be handled. None of the examples throughout that spec have quotes. The section specific to Target says, "If the target_id contains spaces, they must be escaped as hex escape %20.". So it's only referring to the target_id (the first thing in the space-separated Target string).
In your example line, everything is quoted except for Target. gffutils uses the concept of a dialect to figure out which of the many flavors of GTF/GFF a file might be. There's a field for quoted GFF2 values
which for your example line is detected as True
, because it's seeing the other quoted attribute values.
I would argue that this example line is internally inconsistent (all attributes should be quoted or none should be quoted). Based on the spec, I'm not convinced Target is an attribute that should be handled as a special case.
So to truly fix this in gffutils and get round-trip invariability, I think we'd need to keep track of the quote status of each attribute, and store that information either at the db level. Or, in the pathological case, at the feature level if some Target attributes are quoted in the file and some are not).
That would be extra overhead and so would incur a performance hit in the general case. So it could be enabled with a kwarg. I think the main places this would need to be addressed in are in parser._split_keyvals()
and parser._reconstruct()
. And constants.dialect
would need to support a new key, I suppose that order
(as used in _reconstruct
) could be used as a template.
Happy to look at a PR to address this.
Thanks @daler , I agree with your assessement & first pass at an idea at implementation. I'll try to protect some time to work on a PR. But in all honesty, I am on leave while also working on paper revisions (thanks academia!) and my bandaid works in the meantime...s o I may not get to this a PR for ~6+ months, even though I appreciate the round-trip invariance is a pretty important property for me and for other users of gffutils
.
No worries at all, take your time. I'll keep this open for now but label as feature request.
Hi there,
I've loaded a GFF file into memory like this:
If I do this:
The printed Target attribute value gains quotes, when it shouldn't: example input:
example output:
The value of the Target attribute should not have quotes per the GFF3 spec: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md#:~:text=23%20.%20.%20.%20ID%3DMatch1-,%3BTarget%3D,-EST23%201%2021
Should I be exporting the features a different way, or is
Target
gaining quotes a bug within the__repr__
of the gffutilsfeature
?