genometools / genometools

GenomeTools genome analysis system.
http://genometools.org
Other
296 stars 64 forks source link

GFF3 escaping #198

Open satta opened 10 years ago

satta commented 10 years ago

I was just wondering whether GFF3 escaping is used in the GFF3 input/output streams at all?

[ss34@mib106184i:~/develop/genometools] git grep gt_gff3_escape
src/extended/gff3_escaping.c:static int gt_gff3_escape_hex2prnchar(const char* str, char *out)
src/extended/gff3_escaping.c:static void gt_gff3_escape_controlchar2hex(GtStr *escaped_seq, char ctrlchar)
src/extended/gff3_escaping.c:void gt_gff3_escape(GtStr *escaped_seq, const char *unescaped_seq,
src/extended/gff3_escaping.c:          gt_gff3_escape_controlchar2hex(escaped_seq, *cc);
src/extended/gff3_escaping.c:        int ret = gt_gff3_escape_hex2prnchar(cc, &x);
src/extended/gff3_escaping.c:    gt_gff3_escape(escaped_seq, unescaped_testseq, strlen(unescaped_testseq));
src/extended/gff3_escaping.h:void gt_gff3_escape(GtStr *escaped_seq, const char *unescaped_seq,
src/gth/sa.c:  gt_gff3_escape(sa->gff3_target_attribute, gt_str_get(sa->ref_id),
[ss34@mib106184i:~/develop/genometools] git grep gt_gff3_unescape 
src/extended/gff3_escaping.c:int gt_gff3_unescape(GtStr *unescaped_seq, const char *escaped_seq,
src/extended/gff3_escaping.c:    had_err = gt_gff3_unescape(unescaped_seq, gt_str_get(escaped_seq),
src/extended/gff3_escaping.c:    had_err = gt_gff3_unescape(unescaped, code, 10, err);
src/extended/gff3_escaping.c:  gt_ensure(gt_gff3_unescape(seq, "foo%2", 5, NULL));
src/extended/gff3_escaping.h:int  gt_gff3_unescape(GtStr *unescaped_seq, const char *escaped_seq,
src/extended/gff3_parser.c:    had_err = gt_gff3_unescape(unescaped_target,
src/tools/gt_extracttarget.c:    had_err = gt_gff3_unescape(unescaped_target,

I implemented that quite some time ago, but I always figured it would actually be used somewhere in a common place in the GFF output visitor or GFF parser. The parser only uses it for the target attribute, AFAICS. Am I missing something (like this was not done on purpose) or is this functionality actually missing?

gordon commented 10 years ago

After looking at it it seems that this functionality is actually missing. IMHO the best strategy would be to have everything in unescaped form internally. That is, unescape in the GFF parser and escape in the GFF output visitor. We need a couple of test cases to make sure this is implemented correctly. What do you think?

satta commented 10 years ago

AFAICS it's not that easy when attributes with multiple values are involved. These are supposed to be separated by commas, which must then obviously be escaped when they appear within a value. This is quite common, e.g. when storing product names in gene product annotations:

PyYM_01_v1  chado   polypeptide 546193  547247  .   +   .   ID=PYYM_0113700.1:pep;Derives_from=PYYM_0113700.1;isObsolete=false;product=YIR protein%2C pseudogene,PIR protein%2C pseudogene

(note: fabricated example, but you get the idea)

While this is not a problem for unescaping, it raises a problem for escaping: one needs to know which comma needs to be escaped (as it is part of the value), and which one must stay a literal comma (as it is a multi-value separator). Otherwise either names get split into multiple values or the multi-value is flattened into one single value. A generic way to deal with this would be to add and maintain a data structure which allows to discern the individual entries in the multi-value attribute (i.e. extra pointers into the GtTagValueMap) and additional attribute API functions to create and access them.

Any other ideas?

gordon commented 10 years ago

I see, that would require to extent the GtTagValueMap and the GtFeatureNode API. Right now the complete value (even for multi-value attributes) is returned and the escaping is not touched at all. A change might not be backwards compatible and we never had an actual problem with this. Maybe we should leave it as it is and let the user deal with the splitting and unescaping of multi-value attributes?

satta commented 10 years ago

Well, that would lead to additional overhead, e.g. in the bindings, in which I would really like not to have to worry about escaping/unescaping. But then the bindings would have to either (redundantly) implement the escaping by themselves, or (expensively) call another C function. However, I propose that at least the multi-value property of an attribute can be queried using the API so someone can know whether an attribute value must be processed as a multi-value entry or not. Do you agree? I'd go with something like:

bool gt_feature_node_attribute_has_multiple_values(GtFeatureNode*, const char*);
gordon commented 10 years ago

Is there an actual problem which you are trying to solve? If not, I wouldn't bother with it. Let the person who has an actual problem deal with it. http://c2.com/cgi/wiki?YouArentGonnaNeedIt explains the concept better than I can.

satta commented 10 years ago

Yes, I am currently using the custom visitors/streams in the Ruby bindings to write a ad-hoc tool to discover and fix some specific semantic errors on GFF3 annotations. As some of the attributes contain multi-value fields, each of which contain escaped data with more internal structure (i.e. escaped semicolons, equal signs, ...) I would like to process, I'd have expected the attribute accessor methods in GenomeTools to take care of the escaping/unescaping for me. That would have made it trivial for me to check whether the software which created the GFFs in the first place did its job right. I totally agree, implementing stuff just because someone might need it in the future for completeness' sake is often a waste of time and effort. But in this case I was just surprised because I invested quite some time in getting the escaping library functions right ;)

gordon commented 10 years ago

OK, then it makes total sense ;-) I thought about the problem today and I think the current design is just wrong and we have to change it. It is erroneously assumed that an attribute maps to a single value, but it can in fact map to multiple value. Since the implementation is very space critical the tag value map should be extended in a way that is still memory efficient. My idea is to change the current memory layout to:

tag\0value\0tag\0value1\0\0value2\0\0\0

That is, the map is terminated by three zeros instead of two and two zeros denote an additional value for a tag. That would lead to some API breakages, but is probably the cleanest approach. With this we can also properly unescape the GFF3 internally and escape it again in the output. What do you think?

satta commented 10 years ago

Sounds good! Better space-wise than using pointers into the representation. Do you have suggestions for API extensions allowing multi-value access?

gordon commented 10 years ago

The tag value list I would extend like this:

/* Iterator function used to iterate over tag/value maps. A <tag>/<value> pair
   and user <data> are given as arguments. For tags with multiple values, the
   function is called once for each value. On the first invocation <multi_entry>
   is set to <false> and on each subsequent invocation to <true>. */
typedef void (*GtTagValueMapIteratorFunc)(const char *tag, const char *value,
                                          bool multi_entry, void *data);

to make sure that the GFF3 visitor is still running efficiently (breaks the API).

I would not change the feature node API gt_feature_node_get_attribute and just adjust the semantic to return the first value. I would add the following to APIs:

/* Return the (first) attribute of <feature_node> with the given <name>.
   If no such attribute has been added, <NULL> is returned.
   Sets <is_multi> to <true> if this attribute has multiple values.
   The attributes are stored in column 9 of GFF3 feature lines. */
const char*    gt_feature_node_get_attribute_multi_info(const GtFeatureNode
                                                        *feature_node,
                                                        const char *name,
                                                        bool *is_multi);

/* Return the <n>th attribute value of <feature_node> with the given <name>.
   If no such attribute has been added or it does not have that many values,
   <NULL> is returned.
   The attributes are stored in column 9 of GFF3 feature lines. */
const char*    gt_feature_node_get_multi_attribute(const GtFeatureNode
                                                        *feature_node,
                                                        const char *name,
                                                        GtUword n);

to allow to query multi value attributes. These are just suggestions of course, what do you think of them?

standage commented 10 years ago

Along with these would it not also be helpful to query how many values the attribute has? Grabbing the nth attribute value is only helpful if you know the maximum value of n, am I right?

satta commented 10 years ago

I agree.

gordon commented 10 years ago

OK, I take that into account and code something up.

gordon commented 10 years ago

This is more complicated to implement than I originally thought and I can't type that much right now (finger injury).

satta commented 10 years ago

Ouch! I hope you are getting better soon. Let's just keep this on our list, I can do it as well.

gordon commented 10 years ago

Thanks, should be fine again in a couple of days.

peterjc commented 8 years ago

I think the root cause of https://github.com/sanger-pathogens/gff3toembl/issues/50 is that the genometools GFF3 parser is not unescaping %2C to comma (etc).

As a short term hack, is gt_gff3_escape available in the Python bindings so this can be explicitly applied to the parser output?

satta commented 8 years ago

No, it's not. I can add it but it might take some time... Would the percent-encoding functions in Python's urllib also work as a workaround?

peterjc commented 8 years ago

@satta thanks. I'm looking at using Python's percent-encoding or just special casing the comma as a possible short term workaround in https://github.com/sanger-pathogens/gff3toembl/issues/50 - I'm dealing with two unfamiliar code bases so haven't managed to pin down the best place to insert the change yet.