daler / gffutils

GFF and GTF file manipulation and interconversion
http://daler.github.io/gffutils
MIT License
287 stars 78 forks source link

fix #198 #208

Closed daler closed 1 year ago

daler commented 1 year ago

This provides a better solution for #198.

Previously, parsing the following attributes would raise an exception complaining about inconsistency: there are repeated db_xref keys but there's also a comma in the description. How to interpret the comma?

db_xref "GeneID:653635"; db_xref "HGNC:HGNC:38034"; description "WASP family homolog 7, pseudogene";

With this PR, repeated keys will now always win over commas in a value, forcing a resolution to the inconsistency. Furthermore, to prevent these attributes, with no repeated keys but with a comma in the description:

db_xref "HGNC:HGNC:38034"; description "WASP family homolog 7, pseudogene";

from being parsed into this:

>>> f.attributes['description']
['WASP family homolog 7', ' pseudogene']
# Note leading space ------^
# (this no longer happens with this PR)

then with this PR, if we see a space after any of the commas then we assume it is NOT a repeated value and don't split it, so we now get this:

>>> f.attributes['description']
['WASP family homolog 7, pseudogene']

that is, now we expect repeated values have NO space after commas. So this:

db_xref "HGNC:HGNC:38034"; description "WASP family homolog 7,pseudogene";
# This one has no space -------------------------------------^
# We now assume true multi-value attributes have no space 

gets parsed to

>>> f.attributes['description']
['WASP family homolog 7', ' pseudogene']

because the lack of space after the comma means it's interpreted as a multi-value attribute.