chapmanb / bcbb

Incubator for useful bioinformatics code, primarily in Python and R
http://bcbio.wordpress.com
604 stars 243 forks source link

over url-encoding in attribute fields #86

Open bobbyo opened 10 years ago

bobbyo commented 10 years ago

In trying to add a Name=value field to my data, and have GFFOutput.py write it, I find that the value field is being fully URL encoded, which is different from the gff3 specification. In my case, it means attributes like: NAME=jgi.p|Schco3|1037802 end up urlencoded like this: NAME=jgi.p%7CSchco3%7C1037802 which causes problems with our downstream data use. I believe these should not be escaped according to the gff3 standard. The gff3 standard v 1.21 says:

URL escaping rules are used for tags or values containing the following characters: ",=;". Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape.  -- http://www.sequenceontology.org/gff3.shtml 

So the rule seems to be:

  1. attribute key or value variables should be fully URL escaped when they contain ",=;"
  2. attribute key or value TAB characters should always be escaped, but having TAB does not trigger full url encoding of that key or value

The attribute key and value in NAME=jgi.p|Schco3|1037802 do not contain ",=;". Hence this should not be escaped.

Do you agree? Would you like a patch to GFFOutput.py that provides a routine following those rules for escaping values?

chapmanb commented 10 years ago

Bobby; That would be great. I wish the spec had a more consistent and standard quoting approach instead of something custom, hence my use of urllib.quote/unquote. If it's causing issues with downstream tools, it would make sense to clean it up and I'd be happy to accept a patch. Sorry about the issues and thanks for looking at this.

bobbyo commented 10 years ago

Here is a patch; feel free to tighten/modify as you wish.

The gff3 standard seems to make using the encoding it a bit tough, as how does one know when URL-encoding like procedures have been used, e.g. I'm not clear on how you know for certain to use URL-decoding when reading the gff3 data back in. But this patch does apply the encoding that the gff3 standard seems to be requesting. I confess that in the case that caused me to write the patch, the standard suggests the data should not be encoded, which is the use case I tested.

Best, Bobby O

On Mon, Apr 21, 2014 at 8:25 AM, Brad Chapman notifications@github.comwrote:

Bobby; That would be great. I wish the spec had a more consistent and standard quoting approach instead of something custom, hence my use of urllib.quote/unquote. If it's causing issues with downstream tools, it would make sense to clean it up and I'd be happy to accept a patch. Sorry about the issues and thanks for looking at this.

— Reply to this email directly or view it on GitHubhttps://github.com/chapmanb/bcbb/issues/86#issuecomment-40943758 .


Robert P Otillar, PhD Bioinformatics Analyst Joint Genome Institute Genomic Annotation Division 2800 Mitchell Drive Walnut Creek, CA 94598 Tel: 925-296-5786 Fax: 925-296-5752

RPOtillar@lbl.gov

chapmanb commented 10 years ago

Bobby; Thanks much for looking at this. I didn't see a patch in your reply. Could you send a pull request, or post the patch as a Gist? Thanks again.

bobbyo commented 10 years ago

Sorry; oddly I did see it attached to my earlier email; here it is again. I definitely see it attached to this email, as attachment:

GFFOutput.col9_encoding_fix.patch (2k)

Let me know if it does not come through.

-B

On Sat, May 10, 2014 at 11:37 AM, Brad Chapman notifications@github.comwrote:

Bobby; Thanks much for looking at this. I didn't see a patch in your reply. Could you send a pull request, or post the patch as a Gist? Thanks again.

— Reply to this email directly or view it on GitHubhttps://github.com/chapmanb/bcbb/issues/86#issuecomment-42750290 .


Robert P Otillar, PhD Bioinformatics Analyst Joint Genome Institute Genomic Annotation Division 2800 Mitchell Drive Walnut Creek, CA 94598 Tel: 925-296-5786 Fax: 925-296-5752

RPOtillar@lbl.gov

chapmanb commented 10 years ago

Bobby; These e-mails come in as GitHub issue comments, and it looks like they remove attachments so I'm not getting it. You can see them on the issue page:

https://github.com/chapmanb/bcbb/issues/86

A Gist (https://gist.github.com/) with the patch is probably the best approach. Thanks again.