Closed abdulghaffar349 closed 1 year ago
Hi.
Basically this kind of package like msgreader is a heavy wrapper of something. It will be better to share this idea, in order to discuss about better resolution.
.msg
file format is known as CFBF.
We can open it with modern 7-Zip file manager.
We can see that there are many files stored on the root folder like __substg1.0_0C1F001F
.
Exactly they are the entity of properties that this kind of library calls it property.
Pressing F3 on the file __substg1.0_0C1F001F
will bring notepad for preview of this selected file.
This is the basic method about how to perform manual extraction by a human hand. The problem is that how to automate/optimize this kind of work, by using machine technology runnable with consuming electric power.
The implementations of CFBF may be found around npmjs.
Although this msgreader also includes a Reader class which supports CFBF reading, this implementation is in legacy design and this may contains any bugs which appear only on rare case. Actually I have fixed a terrible bug contained in this forked msgreader in this February: https://github.com/HiraokaHyperTools/msgreader/commit/9908c12cca4283fa27e08c488948f275eff8b8b7
If it requires more performance on msg file readering, and also it runs on limited case, direct access to msg may be better selection. Eventually, the better solution may depends on its usecase...
Hi,
Firstly, thanks for your timely reply.
I have worked with msg-parser where we can get and decode the property using the getProperty
method. The issue with msg-parser
is it consumes a lot of memory. I tried to parse the 70MB MSG file
and it used around 2 GB which is crashing the Electron application. The same file only consumed around 170MB of memory which is a huge difference.
But msg-parser
took 7 seconds
to parse the same file while msg-reader
took around 30 seconds
. I think this is a difference due to decoding As getFileData
parsing a lot of properties that are not required for my use case and using msg-parser
I can load only the required one.
Currently, I have a short time to meet deadlines. So don't wanna dig into compound files and it's working. It would be great if I found any ready-made solution.
Hi
Thanks for detailed test measurement of CPU / memory usage of msg-parser and this msgreader.
(As a conclusion from my rough quick review) the slow comes from data reading from CFBF file system (especially readDataByBlockSmall
method) other than decoding of properties.
Although I'm not sure the test 70 MB .msg
file content you have used, I could obtain around 35 MB msg
by forwarding 124 mail messages from Outlook app.
About this msgreader, the pre-defined properties can be reduced by editing NAME_MAPPING
in const.ts
.
// example (use fields as needed)
NAME_MAPPING: {
// email specific
'001a': 'messageClass',
'0037': 'subject',
},
There are only 2 in my test. Nevertheless the high CPU time costs doesn't differ so much.
I'll look into more detail about performance reason.
Ok
I could optimize some bottle necked parts, and there is a significant performance up.
Please try @kenjiuno/msgreader@1.20.0-alpha.1
standalone demo site https://hiraokahypertools.github.io/msgreader_demo/
The reason of slow might be many usage of new Uint8Array(...)
in this msgreader.
just note
TypedArray
(Uint8Array
or such) is a view and it doesn't host memory buffer.
Instead ArrayBuffer
holds actual memory buffer.
new Uint8Array(length)
will automatically host a new ArrayBuffer
.
The problem is that ArrayBuffer
is a Transferable object that requires synchronization between OS threads (Workers).
Thus using many count of new ArrayBuffer
cost much CPU resource.
It is not like new Object()
fundamentally.
Hi @kenjiuno,
I really appreciate your dedication and time spent investigating the issues. Your efforts are truly commendable.
last night I'm to decode the required fields by limiting the NAME_MAPPING
as you suggested earlier has resulted in a significant reduction in time. It's great to see such tangible improvements in performance.
I'll try out the new changes you mentioned. Given the size of the files I'm working with, which can be up to 250 MBs, any enhancements in performance will be incredibly valuable.
Once again, thank you for your dedication and hard work.
Thank you for developing this incredible package.
I have been using MsgReader and find it to be memory efficient, which is commendable. However, I have noticed that when calling the getFileData function, it attempts to decode all the properties mentioned in the FieldsData. This process can be quite slow, especially when dealing with a large number of properties.
I was wondering if there is a way to decode only the properties that are required. For instance,
const senderEmail = msgReader.getProperty(PidTagSenderEmailAddress)
Being able to selectively decode required properties would greatly improve the performance of the package, especially in scenarios where decoding all possible properties is not necessary. Is there any existing functionality or workaround to achieve this?
Thank you once again for your hard work on this package. I appreciate any guidance or suggestions you can provide.