HiraokaHyperTools / msgreader

40 stars 9 forks source link

Improve performance by selectively decoding required properties. #37

Closed abdulghaffar349 closed 1 year ago

abdulghaffar349 commented 1 year ago

Thank you for developing this incredible package.

I have been using MsgReader and find it to be memory efficient, which is commendable. However, I have noticed that when calling the getFileData function, it attempts to decode all the properties mentioned in the FieldsData. This process can be quite slow, especially when dealing with a large number of properties.

I was wondering if there is a way to decode only the properties that are required. For instance, const senderEmail = msgReader.getProperty(PidTagSenderEmailAddress)

Being able to selectively decode required properties would greatly improve the performance of the package, especially in scenarios where decoding all possible properties is not necessary. Is there any existing functionality or workaround to achieve this?

Thank you once again for your hard work on this package. I appreciate any guidance or suggestions you can provide.

kenjiuno commented 1 year ago

Hi.

Basically this kind of package like msgreader is a heavy wrapper of something. It will be better to share this idea, in order to discuss about better resolution.

.msg file format is known as CFBF. We can open it with modern 7-Zip file manager.

2023-05-24_10h01_22

We can see that there are many files stored on the root folder like __substg1.0_0C1F001F. Exactly they are the entity of properties that this kind of library calls it property.

Pressing F3 on the file __substg1.0_0C1F001F will bring notepad for preview of this selected file.

2023-05-24_10h05_54

This is the basic method about how to perform manual extraction by a human hand. The problem is that how to automate/optimize this kind of work, by using machine technology runnable with consuming electric power.

The implementations of CFBF may be found around npmjs.

compound-binary-file-js - npm

Although this msgreader also includes a Reader class which supports CFBF reading, this implementation is in legacy design and this may contains any bugs which appear only on rare case. Actually I have fixed a terrible bug contained in this forked msgreader in this February: https://github.com/HiraokaHyperTools/msgreader/commit/9908c12cca4283fa27e08c488948f275eff8b8b7

If it requires more performance on msg file readering, and also it runs on limited case, direct access to msg may be better selection. Eventually, the better solution may depends on its usecase...

abdulghaffar349 commented 1 year ago

Hi,

Firstly, thanks for your timely reply.

I have worked with msg-parser where we can get and decode the property using the getProperty method. The issue with msg-parser is it consumes a lot of memory. I tried to parse the 70MB MSG file and it used around 2 GB which is crashing the Electron application. The same file only consumed around 170MB of memory which is a huge difference.

But msg-parser took 7 seconds to parse the same file while msg-reader took around 30 seconds. I think this is a difference due to decoding As getFileData parsing a lot of properties that are not required for my use case and using msg-parser I can load only the required one.

Currently, I have a short time to meet deadlines. So don't wanna dig into compound files and it's working. It would be great if I found any ready-made solution.

kenjiuno commented 1 year ago

Hi Thanks for detailed test measurement of CPU / memory usage of msg-parser and this msgreader. (As a conclusion from my rough quick review) the slow comes from data reading from CFBF file system (especially readDataByBlockSmall method) other than decoding of properties.

Although I'm not sure the test 70 MB .msg file content you have used, I could obtain around 35 MB msg by forwarding 124 mail messages from Outlook app.

About this msgreader, the pre-defined properties can be reduced by editing NAME_MAPPING in const.ts.

      // example (use fields as needed)
      NAME_MAPPING: {
        // email specific
        '001a': 'messageClass',
        '0037': 'subject',
      },

There are only 2 in my test. Nevertheless the high CPU time costs doesn't differ so much.

I'll look into more detail about performance reason.

kenjiuno commented 1 year ago

Ok I could optimize some bottle necked parts, and there is a significant performance up. Please try @kenjiuno/msgreader@1.20.0-alpha.1

standalone demo site https://hiraokahypertools.github.io/msgreader_demo/

kenjiuno commented 1 year ago

The reason of slow might be many usage of new Uint8Array(...) in this msgreader.

kenjiuno commented 1 year ago

just note

TypedArray (Uint8Array or such) is a view and it doesn't host memory buffer. Instead ArrayBuffer holds actual memory buffer. new Uint8Array(length) will automatically host a new ArrayBuffer.

The problem is that ArrayBuffer is a Transferable object that requires synchronization between OS threads (Workers). Thus using many count of new ArrayBuffer cost much CPU resource. It is not like new Object() fundamentally.

abdulghaffar349 commented 1 year ago

Hi @kenjiuno,

I really appreciate your dedication and time spent investigating the issues. Your efforts are truly commendable.

last night I'm to decode the required fields by limiting the NAME_MAPPING as you suggested earlier has resulted in a significant reduction in time. It's great to see such tangible improvements in performance.

I'll try out the new changes you mentioned. Given the size of the files I'm working with, which can be up to 250 MBs, any enhancements in performance will be incredibly valuable.

Once again, thank you for your dedication and hard work.