daler / gffutils

GFF and GTF file manipulation and interconversion
http://daler.github.io/gffutils
MIT License
287 stars 78 forks source link

Value of Target attribute gains quotes when it shouldn't in round trip manipulation #230

Open photocyte opened 6 months ago

photocyte commented 6 months ago

Hi there,

I've loaded a GFF file into memory like this:

 db = gffutils.create_db(args.gff,":memory:", force=True,merge_strategy="create_unique")

If I do this:

for feature in db.all_features():
        feature.seqid = pepid_to_scafid[feature.seqid]
        feature.start = json_data['p2g'][str(feature.start)]
        feature.end = json_data['p2g'][str(feature.end)]
        ## Requires a patched gffutils to work with downstream tools, where the feature.py prints the Target attribute without quotes
        print(feature)

The printed Target attribute value gains quotes, when it shouldn't: example input:

12B1-Scaf17 Gene3D  protein_match   2346889 2347135 6.9e-15 +   .   ID="pep1__G3DSA:1.10.1200.10_40657_40739";date="11-10-2023";Target=pep1 40657 40739;Name="ACP39";status="T";Dbxref="InterPro:IPR036736"

example output:

12B1-Scaf17 Gene3D  protein_match   2346889 2347135 6.9e-15 +   .   ID="pep1-1__G3DSA:1.10.1200.10_40657_40739";date="11-10-2023";Target="pep1 40657 40739";Name="ACP39";status="T";Dbxref="InterPro:IPR036736"

The value of the Target attribute should not have quotes per the GFF3 spec: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md#:~:text=23%20.%20.%20.%20ID%3DMatch1-,%3BTarget%3D,-EST23%201%2021

Should I be exporting the features a different way, or is Target gaining quotes a bug within the __repr__ of the gffutils feature?

photocyte commented 6 months ago

Here is a commit on my fork of the repo that implements a bandaid, in case it is handy to anyone.. But, it's not clean enough for a PR and I'm unsure of where else in the codebase this issue should be properly fixed, if it is indeed an issue.

https://github.com/photocyte/gffutils/commit/353eb8b6b32b591941d6daa2b21ccbc14912d164

daler commented 5 months ago

By my reading of the spec, it's not clear on how quotes should be handled. None of the examples throughout that spec have quotes. The section specific to Target says, "If the target_id contains spaces, they must be escaped as hex escape %20.". So it's only referring to the target_id (the first thing in the space-separated Target string).

In your example line, everything is quoted except for Target. gffutils uses the concept of a dialect to figure out which of the many flavors of GTF/GFF a file might be. There's a field for quoted GFF2 values which for your example line is detected as True, because it's seeing the other quoted attribute values.

I would argue that this example line is internally inconsistent (all attributes should be quoted or none should be quoted). Based on the spec, I'm not convinced Target is an attribute that should be handled as a special case.

So to truly fix this in gffutils and get round-trip invariability, I think we'd need to keep track of the quote status of each attribute, and store that information either at the db level. Or, in the pathological case, at the feature level if some Target attributes are quoted in the file and some are not).

That would be extra overhead and so would incur a performance hit in the general case. So it could be enabled with a kwarg. I think the main places this would need to be addressed in are in parser._split_keyvals() and parser._reconstruct(). And constants.dialect would need to support a new key, I suppose that order (as used in _reconstruct) could be used as a template.

Happy to look at a PR to address this.

photocyte commented 5 months ago

Thanks @daler , I agree with your assessement & first pass at an idea at implementation. I'll try to protect some time to work on a PR. But in all honesty, I am on leave while also working on paper revisions (thanks academia!) and my bandaid works in the meantime...s o I may not get to this a PR for ~6+ months, even though I appreciate the round-trip invariance is a pretty important property for me and for other users of gffutils.

daler commented 5 months ago

No worries at all, take your time. I'll keep this open for now but label as feature request.