Whitespace between headers

bradenneal1 commented 4 years ago

The regular expression MESSAGE_REGEX does not allow whitespace (or newlines) between each header. For example, if the test MESSAGE_1 is defined as:

MESSAGE1 = """{1:F01ASDFJK20AXXX0987654321}
{2:I103ASDFJK22XXXXN}
{4: :20:20180101-ABCDEF :23B:GHIJ :32A:180117CAD5432,1 :33B:EUR9999,0 :50K:/123456-75901 SOMEWHERE New York 999999 GR :53B:/20100213012345 :57C://SC200123 :59:/201001020 First Name Last Name a12345bc6d789ef01a23 Nowhere NL :70:test reference test reason payment group: 1234567-ABCDEF :71A:SHA :77B:Test this
-}"""

It does not parse:

>>> import mt103
>>> message = mt103.MT103(MESSAGE1)
>>> message.text
>>>

Redefining the regex to accept whitespace characters between headers:

MESSAGE_REGEX = re.compile(
    r"^"
    r"({1:(?P<basic_header>[^}]+)})?\s*"
    r"({2:(?P<application_header>(I|O)[^}]+)})?\s*"
    r"({3:"
        r"(?P<user_header>"
            r"({113:[A-Z]{4}})?"
            r"({108:[A-Z 0-9]{0,16}})?"
            r"({111:[0-9]{3}})?"
            r"({121:[a-zA-Z0-9]{8}-[a-zA-Z0-9]{4}-4[a-zA-Z0-9]{3}-[89ab][a-zA-Z0-9]{3}-[a-zA-Z0-9]{12}})?\s*"  # NOQA: E501
        r")"
    r"})?"
    r"({4:\s*(?P<text>.+?)\s*-})?\s*"
    r"({5:(?P<trailer>.+)})?"
    r"$",
    re.DOTALL
)

solves the issue

>>> import mt103
>>> message = mt103.MT103(MESSAGE1)
>>> message.text
:20:20180101-ABCDEF :23B:GHIJ :32A:180117CAD5432,1 :33B:EUR9999,0 :50K:/123456-75901 SOMEWHERE New York 999999 GR :53B:/20100213012345 :57C://SC200123 :59:/2010
01020 First Name Last Name a12345bc6d789ef01a23 Nowhere NL :70:test reference test reason payment group: 1234567-ABCDEF :71A:SHA :77B:Test this

danielquinn commented 4 years ago

I'm not sure about this one. Is there somewhere in the spec that says it's ok to have newlines in these locations and not others? Your suggested changes are simple enough, and making your suggested changes does indeed mean that you can parse a message with new lines in it, but I'm not clear on whether the mt103 message in question is valid with new lines in it, or that your suggested placements for new lines represents all the cases where this would be a problem. Do you have a spec I can reference for confirmation?

I ask because the placement of the \s* bits seems strangely arbitrary. You've got one after every section except 5, and they only appear after a header but not between sections

If this is valid:

{1:F01ASDFJK20AXXX0987654321}
{2:I103ASDFJK22XXXXN}
{4: :20:20180101-ABCDEF :23B:GHIJ :32A:180117CAD5432,1 :33B:EUR9999,0 :50K:/123456-75901 SOMEWHERE New York 999999 GR :53B:/20100213012345 :57C://SC200123 :59:/201001020 First Name Last Name a12345bc6d789ef01a23 Nowhere NL :70:test reference test reason payment group: 1234567-ABCDEF :71A:SHA :77B:Test this
-}

Is this not?

{1:F01ASDFJK20AXXX0987654321}
{2:I103ASDFJK22XXXXN}
{
4: :20:20180101-ABCDEF :23B:GHIJ :32A:180117CAD5432,1 :33B:EUR9999,0 :50K:/123456-75901 SOMEWHERE New York 999999 GR :53B:/20100213012345 :57C://SC200123 :59:/201001020 First Name Last Name a12345bc6d789ef01a23 Nowhere NL :70:test reference test reason payment group: 1234567-ABCDEF :71A:SHA :77B:Test this
-}

Might it be better to just message.replace("\n", "") before parsing it, or is that likely to break things elsewhere? Until I'm certain, I'm not keen on making this change. If you have something I can reference to be sure, that'd go a long way toward helping me figure this out.

bradenneal commented 4 years ago

I don't have a specification to provide unfortunately.

I initially was using message.replace("\n", ""), but became unstuck when parsing tags which contain more than 1 component. For example, if the above message was formatted:

{1:F01ASDFJK20AXXX0987654321}
{2:I103ASDFJK22XXXXN}
{4:
:20:20180101-ABCDEF
:23B:GHIJ
:32A:180117CAD5432,1
:33B:EUR9999,0
:50K:/123456-75901
SOMEWHERE
New York
999999
GR
:53B:/2010021301234
:57C://SC200123
:59:/201001020 
First Name Last Name
a12345bc6d789ef01a23
Nowhere
NL
:70:test reference
test reason
payment group:
1234567-ABCDEF
:71A:SHA
:77B:Test this
-}

Both 50K and 59 tags follow a format of Account, Name1, Name2, Address, City/Postal Code. With the newline characters removed, there is no way to determine where "Account" finishes and "Name1" starts etc. Keeping the newlines (and making the parser newline insensitive) allows message.ordering_customer.split('\n') to identify the individual components.

bradenneal commented 4 years ago

You've got one after every section except 5

That's an oversight on my behalf. I would consider a message with trailing whitespace still valid (but have simply never seen one)

danielquinn commented 4 years ago

Alright I've had a conversation with some more financially-minded (as opposed to software like me) -people and it looks like line breaks are common in a message, so I'm going to make this change.

Do you perhaps have a few test messages I can use to ensure that everything works as-expected? All of the messages I have access to have no line breaks.

danielquinn / mt103

Whitespace between headers #6