Closed exander77 closed 4 years ago
Hi, thanks for the report.
Regarding Context
, could you please specify some more information including:
x.y.z
Regarding Detailed Description
, are you able to construct messages that allow you to reproduce each of the above errors?
Neither the version nor operating system matters. These are the same bugs I encountered using ProtonBridge half a year ago. Nothing changed. Hence my complete dissatisfaction. I reported here, on support, no change.
But the system is Ubuntu Linux and Linux version 1.0.0 (latest as of today).
@jameshoulahan I already sent a pack of massages to Bridge team in April (April 23rd, 2020) and broken S/MIME messages later on (April 27th, 2020). These were about Bridge, but exactly the same messages fail in Import-Export tool. Nobody is fixing anything.
I rewrote the message parser, partially based on your feedback. The new parser will be used in the upcoming version of the Import-Export app and in Bridge. If you check the commits in this repository, you'll see that things are indeed changing.
If you wish, you can try a beta version of the app with the new parser from here: https://protonmail.com/download/beta/ie/protonmail-import-export-app_1.1.1-1_amd64.deb
There are a few known quirks with the new parser which will be fixed in an upcoming beta deployment of v1.1.2, but I would appreciate any further feedback you can give me. As you said back in April (?), parsing is difficult, and I agree.
One thing I can report is that neither Bridge nor Import-Export tool is losing an attachment from an email which leads me to believe that the bug was discovered and covered up as I have not received an apology email: Dear customers, there was a bug in ProtonBridge which caused a loss of an attachment.
@jameshoulahan I would like to try it, but I would like to request more space on my account, having three copies of emails simultaneously pushed me to 10GB yesterday and I am not ready to remove them yet and start over, I would do Import-Export to another address.
For such requests, you'll have to contact customer support.
@jameshoulahan I did already, but support...
@jameshoulahan They will respond in like a week if they even do.
Wrote them on October 2nd, 2020. This was when I did first run with the new tool. No response yet.
@jameshoulahan Btw, I didn't notice my every email were missing an attachment until I did a comparison last night, so it went completely unnoticed for half a year. It is really hard to spot errors in large data sets without being able to in-depth analysis of them. I had to write several tools in Python to be able to easily compare messages are migrated. I am now working on a comparison of actual parts of the message. I wonder what I will find there.
Let's keep this issue focussed on one thing, namely the parser issues that you described in detailed description
. For attachment issues, you already have #86 .
Let's postpone discussion about specific parser issues until you have tried an import using the rewritten parser. Virtually none of the old code is there anymore; as such, it will only lead to confusion if we begin deep-diving into issues caused by old code.
Thanks for your continued patience and for helping us to improve our product.
@jameshoulahan When will be v1.1.2 available? If it is in a few days I may wait for it, if it is longer, I will try it ASAP if Support grants me some additional space.
All things going according to plan, we intend to deploy tomorrow on tuesday. But I'm sure you can appreciate that I can't give you any concrete guarantees.
I wrote another mail to support, I would really like to try the new version. But I would need more space.
Funny thing, I had like 4,9GB used on my account and now I have 6,9GB after redoing migration using Import-Export tool. The migration in April lost like 2GB of attachments. I am sure all mails were there. So it was the attachments.
Why is everything with ProtonMail so frustrating? Support responding after 5 days, me responding after 5 minutes and another day passed... I could make like 10 different accounts for testing each with 500MB, but I can't use ProtonBridge with them. And I would be probably banned based on Reddit complaints.
@jameshoulahan So, I found a lot of problems with the previous batch and decided to delete them, so I am running v1.1.2 now. So far it seems sad, I am 11% in and I already have 165 failures.
@jameshoulahan So, 23% in and we are at 551 failures. So over the original...
Also, when deleting emails from my previous migration, it seems that my ThunderBird account was wrecked again. I selected all emails, pressed delete. ThunderBird says all emails are gone, but they are not gone on the server, and the number of emails is not changing. So probably an error somewhere.
So, 39%, 1100 failures.
Results with v1.1.2:
66 "failed to import: failed to create draft: too many retries: Invalid sender"
3 "failed to import: failed to import message: Error while parsing message"
3 "failed to import: failed to parse message: failed to collect attachments: can not get encoding for '' (or '')"
14 "failed to import: failed to parse message: failed to collect attachments: mime: invalid media parameter"
29 "failed to import: failed to parse message: failed to create new parser: encoding error: unhandled encoding \"8t\""
7 "failed to import: failed to parse message: failed to create new parser: illegal base64 data at input byte 0"
3 "failed to import: failed to parse message: failed to create new parser: illegal base64 data at input byte 416"
2 "failed to import: failed to parse message: failed to create new parser: illegal base64 data at input byte 448"
4 "failed to import: failed to parse message: failed to create new parser: illegal base64 data at input byte 5"
3 "failed to import: failed to parse message: failed to create new parser: message: malformed MIME header line: --9929ef3793b8ac4f7afe31260fa69e6a--\r\n"
2 "failed to import: failed to parse message: failed to create new parser: message: malformed MIME header line: --be2b4fba2ec876e0d6baea9120d5eabde--\r\n"
3 "failed to import: failed to parse message: failed to create new parser: message: malformed MIME header line: --bed3fc21cebe086ca423bad0f7ed139b0--\r\n"
3 "failed to import: failed to parse message: failed to create new parser: message: malformed MIME header line: Děkujeme za Váš zájem o produkty z internetového obchodu HyperMotoShop.cz. Vaše objednávka byla přijata a bude v nejbližší době vyřízena.\r\n"
230 "failed to import: failed to parse message: failed to create new parser: multipart: boundary is empty"
3 "failed to import: failed to parse message: failed to create new parser: multipart: NextPart: EOF"
2 "failed to import: failed to parse message: failed to create new parser: quotedprintable: invalid bytes after =: \"\""
5 "failed to import: failed to parse message: failed to create new parser: quotedprintable: invalid unescaped byte 0x11 in body"
62 "failed to import: failed to parse message: failed to create new parser: unexpected EOF"
4 "failed to import: failed to parse message: failed to parse message header: can not get encoding for '' (or '')"
8 "failed to import: failed to parse message: failed to parse message header: mail: expected comma"
1745 "failed to import: failed to parse message: failed to parse message header: mail: header could not be parsed"
3 "failed to import: failed to parse message: failed to parse message header: mail: invalid string"
283 "failed to import: failed to parse message: failed to parse message header: mail: missing @ in addr-spec"
6 "failed to import: failed to parse message: failed to parse message header: mail: missing '@' or angle-addr"
136 "failed to import: failed to parse message: failed to parse message header: mail: no address"
4 "failed to import: failed to parse message: failed to parse message header: mail: no angle-addr"
3 "failed to import: failed to parse message: failed to parse message header: mail: trailing dot in atom"
2 "failed to import: failed to parse message: failed to parse message header: mail: unclosed angle-addr"
Just glancing over it... it has the same error types as before and several new ones.
After this import, I also have a ton of messages in my folder which does not return RFC822 content. It seems like some messages were imported and there are records in the ProtonMail database for them, but the content is not actually there.
After this import, I also have a ton of messages in my folder which does not return RFC822 content. It seems like some messages were imported and there are records in the ProtonMail database for them, but the content is not actually there.
This resolved itself, so probably a server issue?
I'm also experiencing similar parsing issues. Specifically, I've noticed this happens consistently for emails with .docx attachments.
"Error":"failed to import: failed to parse message: failed to collect attachments: mime: unexpected content after media subtype"
What will help us the most is if you can find the messages causing the problems and send us raw examples, if possible, to bridge@protonmail.ch. Ideally examples of messages causing the most happening issue(s). Because this is not good to share openly on GitHub, I'm closing this issue. Please, contact our customer support.
@horejsek Reopen the issue! I already sent you a batch of messages in April and the same messages are still causing errors. There are new messages causing problems and I will prepare a batch with them. But the issue is there and it is not fixed in any way. It is worse.
Comparing old and new results, we see that the following errors appear in both the old and the new parser. There is a definite improvement there though (assuming total number of messages imported is the same). | Old | New |
---|---|---|
140 "Invalid sender" | 66 "Invalid sender" | |
3 "Error while parsing message" | 3 "Error while parsing message" | |
11 "malformed MIME header line | 11 "malformed MIME header line | |
249 "mime: invalid media parameter" | 14 "mime: invalid media parameter" | |
55 "NextPart: EOF" | 3 "NextPart: EOF" |
However, it's obvious there are quite a few issues affecting only the new parser. The most significant appear to be the following:
mail: header could not be parsed
mail: missing @ in addr-spec
and mail: no address
While developing the new parser, the focus was on correctness; we no longer reorder headers or silently strip/drop things, and as I mentioned in #85, the new parser currently only imports things it can parse "confidently". Malformed data, in most cases, leads to an error. In our test accounts, the new parser performed very well; there were little to no messages that could not be imported. However, it seems you have quite a few "special" messages.
As @horejsek said, we would greatly appreciate a sample of the messages that are causing problems for the new parser. This would help us determine whether the messages are okay and the errors are caused by bugs in the new parser itself, or whether the messages are indeed malformed in some way, which would encourage us to bump priority of better handling of "non-critical" parsing errors.
Also @exander77, you mentioned you sent us a batch of messages in April. Just letting you know, we did receive them and we incorporated them into our testing suite. A number of those errors were fixed but some still remain due to various issues in the MIME structure of those messages.
@jameshoulahan I would like to prepare another batch. But it is hard with Import-Export tool because it does not log failed messages and does not even log Message-Id. The tool is pretty user-hostile in that regard. As it is I would have to do it by hand to pick message by subject, but there are many messages with the same subject. Or Am I missing something?
Then I modified imapsync, I logged each failed message with the Message-Id and error in the name of the file.
I can say right away that the nonstandard dates are quite common. All messages were migrated from Gmail so I doubt there are broken addresses there.
I just looked into one case I think it is this:
To: "<Some name>" <someaddress@somewhere.com>
I think you are mishandling special characters in the name in this case <
and >
. This was sent from a completely normal email client. I looked into some other messages and it this would be the most likely cause. Also, some clients use "<Beze jména>", "
The parser correctly handles messages with To: "<Some name>" <someaddress@somewhere.com>
; the <
and >
are not the cause of the problems.
Also, some clients use "<Beze jména>", "" for the address which does not have a contact filled.
This could perhaps be the problem. I look forward to your batch of messages.
Regarding the difficulties identifying exactly which messages are causing issues, this is something we can look into improving in an upcoming app design iteration.
The parser correctly handles messages with
To: "<Some name>" <someaddress@somewhere.com>
; the<
and>
are not the cause of the problems.Also, some clients use "<Beze jména>", "" for the address which does not have a contact filled.
This could perhaps be the problem. I look forward to your batch of messages.
Regarding the difficulties identifying exactly which messages are causing issues, this is something we can look into improving in an upcoming app design iteration.
It is like this "<Beze jména>" followed by email or "" followed by email (not to make confusion).
It would be best if you could provide me with an Import-Export tool version with extended logging:
With this, I can provide you with very good input for debugging. (I already wrote a tool for myself to dump messages by Message-Id).
@jameshoulahan Are there sources for Import-Export tool? I would hack some additional logging for me there in the meantime.
The sources for the Import-Export app are in this repository. You can build it with make build-ie
or run a shell with make run-ie
. Please note that go1.14/go1.15 suffers from a bug in the mail.ParseDate
function (see here) so if you build it yourself, use go1.13.
Here is a patch you can apply to pkg/message/parser.go
that dumps any raw messages that couldn't be imported to ~/.ie-errors/<subject><randomInt>.eml
.
diff --git a/pkg/message/parser.go b/pkg/message/parser.go
index aeb9615..f00d359 100644
--- a/pkg/message/parser.go
+++ b/pkg/message/parser.go
@@ -21,9 +21,12 @@ import (
"bytes"
"fmt"
"io"
+ "io/ioutil"
"mime"
"net/mail"
"net/textproto"
+ "os"
+ "path/filepath"
"strings"
"github.com/ProtonMail/proton-bridge/pkg/message/parser"
@@ -38,7 +41,50 @@ import (
func Parse(r io.Reader, key, keyName string) (m *pmapi.Message, mimeBody, plainBody string, attReaders []io.Reader, err error) {
logrus.Trace("Parsing message")
- p, err := parser.New(r)
+ b, err := ioutil.ReadAll(r)
+ if err != nil {
+ return
+ }
+
+ defer func() {
+ if err == nil {
+ return
+ }
+
+ parseErr := err
+
+ homeDir, err := os.UserHomeDir()
+ if err != nil {
+ return
+ }
+
+ errorsDir := filepath.Join(homeDir, ".ie-errors")
+
+ if err := os.MkdirAll(errorsDir, 0700); err != nil {
+ return
+ }
+
+ dumpName := "unknown-subject"
+ if m != nil && m.Subject != "" {
+ dumpName = m.Subject
+ }
+
+ f, err := ioutil.TempFile(errorsDir, dumpName)
+ if err != nil {
+ return
+ }
+
+ if _, err := f.Write(b); err != nil {
+ return
+ }
+
+ logrus.
+ WithField("path", f.Name()).
+ WithField("parseErr", parseErr).
+ Error("Failed to parse message, dumped original")
+ }()
+
+ p, err := parser.New(bytes.NewReader(b))
if err != nil {
err = errors.Wrap(err, "failed to create new parser")
return
Feel free to add any additional logging you may find useful.
Edit: seems copying/pasting the diff messed up spaces/tabs. Here's it as an attachment: patch.txt
@jameshoulahan One more question, can I hack it easily not to put messages on the server at all? Just do the parsing and then do nothing?
I could retest the messages really quickly this way. I dumped them in the mbox during the weekend, so I could retest my whole mailbox locally that way.
Yes, the easiest way is to just hack the method which uploads messages to the server like so: diff.txt
@jameshoulahan Well, that is a significantly faster way how to test it!
@jameshoulahan Where is this message coming from "header could not be parsed"? I don't see it in sources.
It's not from bridge. I've only ever seen it come from mail/message.go, line 130. It's the result of a bad/nonstandard date. According to this, it's the only occurrence of this error string in the entire golang source.
@jameshoulahan Oh, I see, you mentioned it before. So it should be easily handled by removing the trailing differences for Date parsing. I would say that based on the numbers above this is the most pressing issue as almost 2/3 of errors are because of this.
Some strange things I see just by glancing over it:
From: <somebody@somwhere.com >
To: somebody@somebody.com, ; This is quite common.
To: somebody
To: <some random ascii text with spaces>
To: <Beze jména> <somebody@somebody.com> ; This is extremely common as well. Some clients seem to does this as I have it on all emails with certain people.
To: =?UTF-8?B?PEJlemUgam3DqW5hPg==?= <somebody@somebody.com> ; This is encoded <Beze jména>. Same as above.
From: somebody@somebody.com:81 ; Probably port parsing problem?
Content-Transfer-Encoding: x-custom-encoding ; Custom encoding
Based on the RFC822, I would say that all of this is ok. There can be pretty much anything in the "From", "To" etc. fields.
This is some kind of strange header sequence from one Israeli company:
MIME-Version: 1.0
Content-Transfer-Encoding: 8t
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
I am not really sure if this is RFC compliant per se as custom encodings should be prefixed with x-, but as nothing is actually encoded by it as it is overridden by the next "Content-Transfer-Encoding"... I would suggest handling it in the same way as any custom encoding.
Another finds:
To: undisclosed-recipients:
From: Name somebody@somewhere.com
To: somebody@somewhere.com, ; Trailing comma appears in To as well and is very common.
Reply-To: somebody@somewhere.com, ; Trailing comma appears in Reply-To as well and is very common.
To: <postmaster@[10.10.10.10]>
From: "GES-ELECTRONICS <GES-ELECTRONICS"@nos.ges.cz, a.s. <ges@ges.cz>>
Where is "invalid media parameter" coming from?
I have also glanced over the unparsable dates (I tried to find different cases):
2 Apr 2014 10:27:40
6 Oct 2016 13:48:22
Fri, 10 Aug 2018 11:09:34 -0700 (GMT-07:00)
Fri, 13 Apr 2012 11:01:05 UT
Fri, 13 Dec 2019 10:43:36 +0000 (GMT+00:00)
Fri, 14 Dec 2018 09:19:48 -0700 (GMT-07:00)
Fri, 17 May 2019 10:46:35 +0000 (GMT+00:00)
Fri, 19 Apr 2013 14:18:25 +0200 (METDST)
Fri, 28 May 2010 10:43:37 (GMT)
Mon, 10 Dec 2018 05:35:52 -0700 (GMT-07:00)
Mon, 10 Dec 2018 22:33:45 -0700 (GMT-07:00)
Mon, 10 Nov 2014 07:30:25 +0100 (Støední Evropa (bìný èas))
Mon, 12 May 2014 21:06:45
Sat, 2 Oct 10 19:59:17 GMT-0700
Tuesday 02 March 2010 04:02:33 PM
Wed, 02 Jun 2010 08:22:33 (GMT)
Wed, 5 Mar 2014 05:27:02 +0100 +0100
Wed, 5 May 10 05:59:05 GMT-0700
Many thanks for the valuable feedback! I hope to find some time soon to take a closer look at all the examples you've provided.
"invalid media parameter" most likely comes from line 173 of mime/mediatype.go, indicating that something went wrong while parsing the media type. It can be caused by various issues. One known example is when there are non-ascii chars in an attachment name.
@jameshoulahan I think this way it will be much efficient, I will send you just the specific headers that cause problems. And I can easily rerun if you make improvements to see what changed. I improved error messages on some places, to wrap error with header name etc. I may send a pull request. Having a header name which causes problems in the log is extremely useful.
@jameshoulahan After we improve this on my personal emails, I can run it on over around 300GB of emails I have at hand. When I have it patched like this (not actually storing the emails), I can run it on a lot of samples and it is even quite fast.
New:
Cc: =?ISO-8859-2?Q?Somebody_Somewhere?= <somebody@somewhere.com>, <somebody@somewhere.com,Somebody/AAA/BBB/CCC,>
To: somebody%somewhere...com
From: "Mailer Daemon" <>
From: =?windows-1250?Q?Spr=E1vce_syst=E9mu?=
From: Správce systému ; Decoded line above, it is a message sent from Windows.
To: "'somebody@somewhere.com.'"
To: Somebody Somewhere <somebody@somewhere. com> <somebody@somewhere.com> ; "Somebody Somewhere <somebody@somewhere. com>" was actually encoded together.
To: "somebody@somewhere.com." <somebody@somewhere.com.>
@jameshoulahan
Regarding the:
failed to create new parser: multipart: boundary is empty
Seems like this is some parse error and incorrect error reporting, the boundary is not empty:
Content-Type: multipart/alternative;
boundary=----NODEMAILER-?=_19-1330086946289
Could be related to boundary starting with four ----
?
Messages with the error actually seem completely normal.
It basically means all messages produced by NODEMAILER are unimportable.
Regarding the:
failed to create new parser: multipart: expecting a new Part; got line "--001a113dc39a1c56d0053a206f3e--\n"
and similar. Parser has a problem when boundary does not end with:
--001a113dc39a1c56d0053a206f3e--\r\n"
but with:
--001a113dc39a1c56d0053a206f3e--\n"
This by all means a correct message.
Regarding the:
failed to create new parser: unexpected EOF
It seems to be caused by the body or part ending with space/spaces. Wtf? The message is completely misleading as removing space solves the problem. I would assume that ending part with space is completely valid.
I have also glanced over the unparsable dates (I tried to find different cases):
I have been through RFC5332 and it all seems compliant (except rare cases):
date-time = [ day-of-week "," ] date time [CFWS]
day-of-week = ([FWS] day-name) / obs-day-of-week
day-name = "Mon" / "Tue" / "Wed" / "Thu" /
"Fri" / "Sat" / "Sun"
date = day month year
day = ([FWS] 1*2DIGIT FWS) / obs-day
month = "Jan" / "Feb" / "Mar" / "Apr" /
"May" / "Jun" / "Jul" / "Aug" /
"Sep" / "Oct" / "Nov" / "Dec"
year = (FWS 4*DIGIT FWS) / obs-year
time = time-of-day zone
time-of-day = hour ":" minute [ ":" second ]
hour = 2DIGIT / obs-hour
minute = 2DIGIT / obs-minute
second = 2DIGIT / obs-second
zone = (FWS ( "+" / "-" ) 4DIGIT) / obs-zone
obs-day-of-week = [CFWS] day-name [CFWS]
obs-day = [CFWS] 1*2DIGIT [CFWS]
obs-year = [CFWS] 2*DIGIT [CFWS]
obs-hour = [CFWS] 2DIGIT [CFWS]
obs-minute = [CFWS] 2DIGIT [CFWS]
obs-second = [CFWS] 2DIGIT [CFWS]
obs-zone = "UT" / "GMT" / ; Universal Time
; North American UT
; offsets
"EST" / "EDT" / ; Eastern: - 5/ - 4
"CST" / "CDT" / ; Central: - 6/ - 5
"MST" / "MDT" / ; Mountain: - 7/ - 6
"PST" / "PDT" / ; Pacific: - 8/ - 7
;
%d65-73 / ; Military zones - "A"
%d75-90 / ; through "I" and "K"
%d97-105 / ; through "Z", both
%d107-122 ; upper and lower case
FWS = ([*WSP CRLF] 1*WSP) / obs-FWS
; Folding white space
ctext = %d33-39 / ; Printable US-ASCII
%d42-91 / ; characters not including
%d93-126 / ; "(", ")", or "\"
obs-ctext
ccontent = ctext / quoted-pair / comment
comment = "(" *([FWS] ccontent) [FWS] ")"
CFWS = (1*([FWS] comment) [FWS]) / FWS
Other multi-character (usually between 3 and 5) alphabetic time zones have been used in Internet messages. Any such time zone whose meaning is not known SHOULD be considered equivalent to "-0000" unless there is out-of-band information confirming their meaning.
Comments are allowed at the end and that pretty much covers this:
Fri, 10 Aug 2018 11:09:34 -0700 (GMT-07:00)
Fri, 13 Dec 2019 10:43:36 +0000 (GMT+00:00)
Fri, 14 Dec 2018 09:19:48 -0700 (GMT-07:00)
Fri, 17 May 2019 10:46:35 +0000 (GMT+00:00)
Mon, 10 Dec 2018 05:35:52 -0700 (GMT-07:00)
Mon, 10 Dec 2018 22:33:45 -0700 (GMT-07:00)
Fri, 19 Apr 2013 14:18:25 +0200 (METDST)
Mon, 10 Nov 2014 07:30:25 +0100 (Støední Evropa (bì�ný èas))
This is covered by obsolete timezone (and comment):
Mon, 12 May 2014 21:06:45
Fri, 13 Apr 2012 11:01:05 UT
Fri, 28 May 2010 10:43:37 (GMT)
Wed, 02 Jun 2010 08:22:33 (GMT)
Sat, 2 Oct 10 19:59:17 GMT-0700
Wed, 5 May 10 05:59:05 GMT-0700
This is covered by missing day of the week and obsolete timezone:
2 Apr 2014 10:27:40
6 Oct 2016 13:48:22
This is covered by obsolete timezones, unknown timezone:
Tuesday 02 March 2010 04:02:33 PM
I actually found only one non-compliant date:
Wed, 5 Mar 2014 05:27:02 +0100 +0100
So it is pretty much a bug in parsing library. Probably somebody did not read the RFC carefully.
I would note here that for example, this is an RFC5332 compliant date:
(What)2 Apr 2014 10 (the) : 27: 40 HELLOWORLD (is wrong with this?)
Expected Behavior
Migrate all emails without expectation and do not modify/break them in any way.
Current Behavior
Is unable to migrate emails or breaks them.
Possible Solution
Maybe fix some damn bugs or test it?
Steps to Reproduce
Context (Environment)
My emails were either not migrated at all or broken.
Detailed Description
Possible Implementation
Do not fuck with email contents.