Previously, parsing the following attributes would raise an exception complaining about inconsistency: there are repeated db_xref keys but there's also a comma in the description. How to interpret the comma?
db_xref "GeneID:653635"; db_xref "HGNC:HGNC:38034"; description "WASP family homolog 7, pseudogene";
With this PR, repeated keys will now always win over commas in a value, forcing a resolution to the inconsistency. Furthermore, to prevent these attributes, with no repeated keys but with a comma in the description:
db_xref "HGNC:HGNC:38034"; description "WASP family homolog 7, pseudogene";
from being parsed into this:
>>> f.attributes['description']
['WASP family homolog 7', ' pseudogene']
# Note leading space ------^
# (this no longer happens with this PR)
then with this PR, if we see a space after any of the commas then we assume it is NOT a repeated value and don't split it, so we now get this:
>>> f.attributes['description']
['WASP family homolog 7, pseudogene']
that is, now we expect repeated values have NO space after commas. So this:
db_xref "HGNC:HGNC:38034"; description "WASP family homolog 7,pseudogene";
# This one has no space -------------------------------------^
# We now assume true multi-value attributes have no space
gets parsed to
>>> f.attributes['description']
['WASP family homolog 7', ' pseudogene']
because the lack of space after the comma means it's interpreted as a multi-value attribute.
This provides a better solution for #198.
Previously, parsing the following attributes would raise an exception complaining about inconsistency: there are repeated
db_xref
keys but there's also a comma in the description. How to interpret the comma?With this PR, repeated keys will now always win over commas in a value, forcing a resolution to the inconsistency. Furthermore, to prevent these attributes, with no repeated keys but with a comma in the description:
from being parsed into this:
then with this PR, if we see a space after any of the commas then we assume it is NOT a repeated value and don't split it, so we now get this:
that is, now we expect repeated values have NO space after commas. So this:
gets parsed to
because the lack of space after the comma means it's interpreted as a multi-value attribute.