HiraokaHyperTools / msgreader

35 stars 9 forks source link

Most of parsed data is broken since version 1.11.* #21

Closed GTCrais closed 2 years ago

GTCrais commented 2 years ago

After fixing issue https://github.com/HiraokaHyperTools/msgreader/issues/19 most of other parsed data broke:

1.9.0 (working correctly)

  dataType: 'msg',
  attachments: [
    {
      dataType: 'attachment',
      extension: '.jpg',
      name: 'image001.jpg',
      fileName: 'image001.jpg',
      dataId: 91,
      contentLength: 4295,
      fileNameShort: 'image001.jpg',
      pidContentId: 'image001.jpg@01D1803E.8E7672B0',
      creationTime: 'Thu, 17 Mar 2016 17:22:19 GMT',
      lastModificationTime: 'Thu, 17 Mar 2016 17:22:19 GMT',
      attachmentHidden: true
    },
    {
      dataType: 'attachment',
      name: 'image002.png',
      dataId: 79,
      contentLength: 1426,
      fileNameShort: 'image002.png',
      extension: '.png',
      fileName: 'image002.png',
      pidContentId: 'image002.png@01D18105.46238E90',
      creationTime: 'Fri, 18 Mar 2016 15:00:38 GMT',
      lastModificationTime: 'Fri, 18 Mar 2016 15:00:38 GMT',
      attachmentHidden: true
    }
  ],
  recipients: [
    {
      dataType: 'recipient',
      addressType: 'EX',
      name: '__redacted_name__',
      email: '__redacted_email__',
      smtpAddress: '__redacted_email__',
      recipType: 'to'
    }
  ],
  senderName: '__redacted_name__',
  subject: '__redacted_subject__',
  headers: '__redacted_headers__',
  senderAddressType: 'EX',
  sentRepresentingSmtpAddress: '__redacted_email__',
  senderEmail: '__redacted_email__',
  compressedRtf: Uint8Array(21075) [
     79,  82,   0,   0, 122,  84,   1,   0,  76,  90,  70, 117,
     68, 124, 131,   8,   3,   0,  10,   0, 114,  99, 112, 103,
     49,  50,  53, 130,  50,   3,  67, 104, 116, 109, 108,  49,
      3,  49, 248,  98, 105, 100,   4,   0,   3,  48,   1,   3,
      1, 247,  10, 128,  39,   2, 164,   3, 227,   2,   0,  99,
    104,  10, 192, 115, 101, 248, 116,  48,  32,   7,  19,   2,
    128,  16, 131,   0,  80,   4,  86, 191,   8,  85,   7, 178,
     18,  85,  14,  81,   3,   1,  17,  87,  50,   6,   0,  59,
      6, 195,  18,  85,
    ... 20975 more items
  ],
  lastModifierName: '__redacted_name__',
  senderSmtpAddress: '__redacted_email__',
  inetAcctName: '__redacted_email__',
  body: '__redacted_body__'
  creationTime: 'Fri, 18 Mar 2016 16:00:26 GMT',
  lastModificationTime: 'Fri, 18 Mar 2016 16:00:26 GMT',
  clientSubmitTime: 'Fri, 18 Mar 2016 15:33:22 GMT',
  messageDeliveryTime: 'Fri, 18 Mar 2016 15:33:23 GMT'
}

1.11 and 1.12 (broken)

{
  dataType: 'msg',
  attachments: [
    {
      dataType: 'attachment',
      extension: [Uint8Array],
      name: [Uint8Array],
      fileName: [Uint8Array],
      dataId: 91,
      contentLength: 4295,
      fileNameShort: [Uint8Array],
      pidContentId: [Uint8Array],
      creationTime: 'Thu, 17 Mar 2016 17:22:19 GMT',
      lastModificationTime: 'Thu, 17 Mar 2016 17:22:19 GMT',
      attachmentHidden: true
    },
    {
      dataType: 'attachment',
      name: [Uint8Array],
      dataId: 79,
      contentLength: 1426,
      fileNameShort: [Uint8Array],
      extension: [Uint8Array],
      fileName: [Uint8Array],
      pidContentId: [Uint8Array],
      creationTime: 'Fri, 18 Mar 2016 15:00:38 GMT',
      lastModificationTime: 'Fri, 18 Mar 2016 15:00:38 GMT',
      attachmentHidden: true
    }
  ],
  recipients: [
    {
      dataType: 'recipient',
      addressType: [Uint8Array],
      name: [Uint8Array],
      email: [Uint8Array],
      smtpAddress: [Uint8Array],
      recipType: 'to'
    }
  ],
  senderName: Uint8Array(8) [
    11, 0, 0, 0,
     3, 0, 0, 0
  ],
  messageClass: Uint8Array(8) [
    9, 0, 0, 0,
    3, 0, 0, 0
  ],
  subject: Uint8Array(8) [
    15, 0, 0, 0,
     3, 0, 0, 0
  ],
  headers: Uint8Array(8) [
    247, 6, 0, 0,
      3, 0, 0, 0
  ],
  senderAddressType: Uint8Array(8) [
    3, 0, 0, 0,
    3, 0, 0, 0
  ],
  sentRepresentingSmtpAddress: Uint8Array(8) [
    18, 0, 0, 0,
     3, 0, 0, 0
  ],
  senderEmail: Uint8Array(8) [
    129, 0, 0, 0,
      3, 0, 0, 0
  ],
  compressedRtf: Uint8Array(21075) [
     79,  82,   0,   0, 122,  84,   1,   0,  76,  90,  70, 117,
     68, 124, 131,   8,   3,   0,  10,   0, 114,  99, 112, 103,
     49,  50,  53, 130,  50,   3,  67, 104, 116, 109, 108,  49,
      3,  49, 248,  98, 105, 100,   4,   0,   3,  48,   1,   3,
      1, 247,  10, 128,  39,   2, 164,   3, 227,   2,   0,  99,
    104,  10, 192, 115, 101, 248, 116,  48,  32,   7,  19,   2,
    128,  16, 131,   0,  80,   4,  86, 191,   8,  85,   7, 178,
     18,  85,  14,  81,   3,   1,  17,  87,  50,   6,   0,  59,
      6, 195,  18,  85,
    ... 20975 more items
  ],
  lastModifierName: Uint8Array(8) [
    11, 0, 0, 0,
     3, 0, 0, 0
  ],
  senderSmtpAddress: Uint8Array(8) [
    18, 0, 0, 0,
     3, 0, 0, 0
  ],
  inetAcctName: Uint8Array(8) [
    17, 0, 0, 0,
     3, 0, 0, 0
  ],
  body: Uint8Array(8) [
    156, 34, 0, 0,
      3,  0, 0, 0
  ],
  creationTime: 'Fri, 18 Mar 2016 16:00:26 GMT',
  lastModificationTime: 'Fri, 18 Mar 2016 16:00:26 GMT',
  clientSubmitTime: 'Fri, 18 Mar 2016 15:33:22 GMT',
  messageDeliveryTime: 'Fri, 18 Mar 2016 15:33:23 GMT'
}
kenjiuno commented 2 years ago

1.11 and 1.12 (broken)

Hmm, does this still happen 1.11 and later? How to get that result: node cli parse ... , or by node project dependency?

kenjiuno commented 2 years ago

I have placed a demo site for latest msgreader 1.12.0-alpha.2. You can check the latest decoder of msg reader.

https://hiraokahypertools.github.io/msgreader_demo/

source: https://github.com/HiraokaHyperTools/msgreader_demo

Workable on modern web browsers like Chrome/FireFox. This is a simple webapp using webpacked JavaScript (not WebAssembly).

GTCrais commented 2 years ago

1.11 and 1.12 (broken)

Hmm, does this still happen 1.11 and later? How to get that result: node cli parse ... , or by node project dependency?

Yes. By using node project dependency. Screeshot from online msgreader demo

This was the sequence of events, maybe this can help you:

1) 1.9.0 - working fine 2) Refactoring to 1.10.0: https://github.com/HiraokaHyperTools/msgreader/commit/752dd4b65fc7f41ea13c0ee0f88983f1f6e6af22 https://github.com/HiraokaHyperTools/msgreader/commit/8fa9cbecf50908677b31d49d2a55735adb20af8a https://github.com/HiraokaHyperTools/msgreader/commit/5530d7d5cdbaafebfbe9ff631323e7cba66c5c92 https://github.com/HiraokaHyperTools/msgreader/commit/6f14101010bdbade3f9a049eed5f4963f54f0f0a https://github.com/HiraokaHyperTools/msgreader/commit/0731ae8057b2a37417ae8a67adb84321a016ee43 https://github.com/HiraokaHyperTools/msgreader/commit/f474d0233e8684c22e0e3229ca8acd735031d2ff https://github.com/HiraokaHyperTools/msgreader/commit/a5535cac633df1631a430db3abd56da5a597a10d https://github.com/HiraokaHyperTools/msgreader/commit/441bcd2dcc2c8042a13450349ff696fbbce3ade0 https://github.com/HiraokaHyperTools/msgreader/commit/84942ec98ab6a40fcf05bda82a97ba97fec0d24a https://github.com/HiraokaHyperTools/msgreader/commit/7689c425b5e1d7028e57fd98e010fbb8df889e59 3) Doc updates and docup.js (most likely irrelevant): https://github.com/HiraokaHyperTools/msgreader/commit/8e40bc9e7e581d109c8d978ded59cede72b4500a https://github.com/HiraokaHyperTools/msgreader/commit/941ec2ca92e8a8803839b6d7570cb88f0b0bca59 https://github.com/HiraokaHyperTools/msgreader/commit/14571bff320cb22f98a6cedd53fa0f894b0df69a https://github.com/HiraokaHyperTools/msgreader/commit/022fa4e23aaefc85088ae1bfae882ef9ffc14190 4) 1.10.0 is released. Everything works, except compressedRtf which is broken (incomplete Uint8Array) - https://github.com/HiraokaHyperTools/msgreader/issues/19 5) 1.11.0 is released, which fixed compressedRtf but broke a lot of the other fields which now come out as Uint8Array instead of String: https://github.com/HiraokaHyperTools/msgreader/commit/9c99db2fc6595aa78e89cefa8a19efea375deb5f https://github.com/HiraokaHyperTools/msgreader/commit/dd20240e840ed190e59e4e31cf251d03035eef13 https://github.com/HiraokaHyperTools/msgreader/commit/43b72ea891c3abe383cf6311a0bbed7404c82029

If I had to guess, this is the culprit: https://github.com/HiraokaHyperTools/msgreader/commit/9c99db2fc6595aa78e89cefa8a19efea375deb5f

More accurately, this line - https://github.com/HiraokaHyperTools/msgreader/commit/9c99db2fc6595aa78e89cefa8a19efea375deb5f#diff-df3542ef2be5d49cbfe3ab8237efe551439478c9dd3f8beec8337940efc39190R109 From my understanding, it moves 001e out of TYPE_MAPPING so nothing will ever be matched as string.

It was either one of the 3 1.11.0 commits that broke something, or one of those 3 in combination with 1.10.0 refactoring.

kenjiuno commented 2 years ago

Hi, I want to verify the msg file you have tested. If you don't mind, could you send it to me? ku@digitaldolphins.jp

I'm surprised that Uint8Arrayed problem occurs on online version.

Unfortunately I cannot reproduce this yet! I have tested with 16 test msg files listed on https://github.com/HiraokaHyperTools/msgreader/tree/master/test None of them produced Uint8Arrayed outputs for text fields.

A memo.msg
A schedule.msg
attachAndInline.msg
longerDifat.msg
longerFat.msg
msgInMsg.msg
msgInMsgInMsg.msg
sent.msg
sent2.msg
Subject.msg
test1.msg
test2.msg
unicode1.msg
voteItems.msg
voteNo.msg
voteYes.msg

And also tested with: TestOuterEmail1.msg and TestOuterEmail2.msg

A memo.msg:

2021-11-05_18h13_14

MsgReader's test covers this kind of regression test too.

See test output √ exact match with pre rendered data (except on compressedRtf) in part later.

If A memo.msg doesn't produce expected JSON data, test will fail:

https://github.com/HiraokaHyperTools/msgreader/blob/967d92628db2d846bfa8b579ae2d46845c26f918/test/A%20memo.json#L1-L14

The test can be run by yarn.

H:\Proj\msgreader>yarn
yarn install v1.22.4
[1/5] Validating package.json...
[2/5] Resolving packages...
success Already up-to-date.
$ npm run build && npm run test

> @kenjiuno/msgreader@1.12.0-alpha.2 build
> tsc

> @kenjiuno/msgreader@1.12.0-alpha.2 test
> npm run mocha

> @kenjiuno/msgreader@1.12.0-alpha.2 mocha
> set NODE_ENV=test && mocha

  MsgReader
    test1.msg
      √ exact match with pre rendered data (except on compressedRtf)
      √ compare rtf
    test2.msg
      √ exact match with pre rendered data (except on compressedRtf)
      √ verify attachment: A.txt
      √ compare rtf
    msgInMsg.msg
      √ exact match with pre rendered data (except on compressedRtf)
      √ testMsgAttachment0 === testMsgAttachments0
      √ re-parse and verify rebuilt inner testMsgAttachments0
      √ compare rtf
    msgInMsgInMsg.msg
      √ exact match with pre rendered data (except on compressedRtf)
      √ re-parse and verify rebuilt inner testMsgAttachments0
      √ re-parse and verify rebuilt inner testMsgAttachments0AndItsAttachments0
      √ compare rtf
    Subject.msg
      √ exact match with pre rendered data (except on compressedRtf)
      √ compare rtf
    sent.msg
      √ exact match with pre rendered data (except on compressedRtf)
      √ compare rtf
    sent2.msg
      √ exact match with pre rendered data (except on compressedRtf)
      √ compare rtf
    longerFat.msg
      √ re-parse and verify rebuilt inner testMsgAttachments0
    longerDifat.msg
      √ re-parse and verify rebuilt inner testMsgAttachments0
    attachAndInline.msg
      √ exact match with pre rendered data (except on compressedRtf)
      √ compare rtf (46ms)
    voteItems.msg
      √ exact match with pre rendered data (except on compressedRtf)
      √ compare rtf
    voteNo.msg
      √ exact match with pre rendered data (except on compressedRtf)
    voteYes.msg
      √ exact match with pre rendered data (except on compressedRtf)
    A schedule.msg
      √ exact match with pre rendered data (except on compressedRtf)
      √ compare rtf
    A memo.msg
      √ exact match with pre rendered data (except on compressedRtf)
      √ compare rtf

  Burner
    Compare file contents among Burner/Reader
      √ file size 4095
      √ file size 4096
      √ file size 8192
      √ file size 64000
      √ file size 64513
      √ file size 129537
      √ file size 1024 * 8192 (100ms)
      √ file size 1024 * 8192 * 2 (199ms)
      √ file size 1024 * 8192 * 3 (308ms)

  toHexStr
    √ tests

  DataStream
    √ little.readUint32
    √ big.readUint32
    √ little.offset.readUint32
    √ big.offset.readUint32
    √ little.buffer.readUint32
    √ little.buffer.offset.readUint32
    √ little.readUint32Array
    √ little.readInt32Array
    √ little.readUint16Array
    √ little.readInt16Array
    √ little.readUint32Array +offset
    √ little.readInt32Array +offset
    √ little.readUint16Array +offset
    √ little.readInt16Array +offset

  msftUuidStringify
    √ basic

  toHex
    √ toHex1
    √ toHex2
    √ toHex4

  59 passing (1s)

Done in 10.82s.

I want to distinguish whether this is msg file problem or MsgReader problem.

GTCrais commented 2 years ago

Unfortunately I can't provide the exact .msg file which I'm using because it contains confidential data, but I will try to create one which can be used to reproduce the error.

The problem can't be the .msg file because that exact file works fine with 1.9.0 version, and 1.10.0 version (except in this one the compressedRtf is broken). The issue happens only with versions 1.11.0 and above.

Since I'm not familiar with the codebase of this project, I just want to re-confirm that 001e is really supposed to be out of TYPE_MAPPING here - https://github.com/HiraokaHyperTools/msgreader/commit/9c99db2fc6595aa78e89cefa8a19efea375deb5f#diff-df3542ef2be5d49cbfe3ab8237efe551439478c9dd3f8beec8337940efc39190R109 ?

In any case, I will try to find a non-confidential .msg that causes the same error and provide it to you.

kenjiuno commented 2 years ago

Thanks for patience, you helped this!

1.12.0-alpha.3 is published.

Also you can try msgreader_demo (msgreader@1.12.0-alpha.3). https://hiraokahypertools.github.io/msgreader_demo/

Press F5 in case of older version is cached.

The problem can't be the .msg file because that exact file works fine with 1.9.0 version, and 1.10.0 version (except in this one the compressedRtf is broken). The issue happens only with versions 1.11.0 and above.

Since I'm not familiar with the codebase of this project, I just want to re-confirm that 001e is really supposed to be out of TYPE_MAPPING here - 9c99db2#diff-df3542ef2be5d49cbfe3ab8237efe551439478c9dd3f8beec8337940efc39190R109 ?

You are right. It is clear that bug was introduced by miss placed. Sorry!

0x001E is used for non-Unicode string. 0x001F is for Unicode string.

[MS-OXCDATA]: Property Data Types | Microsoft Docs

FYI my test data files are exported from Outlook 2013. I'm not sure why 0x001F is selected: recent Outlook prefers Unicode string, or Asian version of Outlook prefers it.

GTCrais commented 2 years ago

Perfect, thank you so much! Everything works now!