Esr225+gal exp add body regex html filter, add regex multiline flag, bug fixes in body regex filter, altered logging body regex filter, syntax error fixing all over the place mainly semicolons

Galantha commented 1 week ago

Galantha: ESR225+GalExp-AddHTMLRegExpBodyFilter

add body regex html filter add regex multiline flag bug fixes in body regex filter altered logging body regex filter syntax fixing all over the place mainly semicolons - the cardinal sin, fixing that which is not broken add gui check box for new filter add gui check box for multi line flag wrapped two of four of the .exec in try { } catch blocks

notes?: unable to get any gui settings to persist?-> might have something to do with temporary addon loading?

thoughts: first time I have done anything like this

Let me know your thoughts?

RealRaven2000 commented 1 week ago

That's a big change set, too much to put into a single merge. I suggest I cherry pick the smaller changes (such as missing semicolons) manually and merge them into the current branch, then we can do a fresh compare. Can you explain the differences of the regex flavors you added?

One thing I find problematic at the moment is that for every flavor we are adding new menu items to the search dropdown, it gets very unwieldy, I would rather have a single regex body command and then add the flavors as options somehow - it may mean to modify the (as always single) regex parameter to transport additional switches. Maybe something like this:

body regex - matches - /regular expression/(options)

Examples:

body regex - matches - /regular expression/(html)
body regex - matches - /regular expression/(lines)

I could potentially imagine an additional settings control like this, to add these options and build a regex more easily:

Also I believe the new regex match function are all very similar, I will put the common code into a merged util function so we only have the mime body iterator code once in case it needs fixing later.

RealRaven2000 commented 1 week ago

Did the first round of changes (mainly formatting, semicolons). Would like to deal with the recursion exception within the function that generates it. Will look at your other PR first

Galantha commented 1 week ago

on this topic specifically: https://github.com/RealRaven2000/FiltaQuilla/issues/269#issuecomment-2454620541

I feel that code brings limited value to the thunder bird user. If causing error, limited reason to keep.

With the overflow error: My understanding regular expression .exec will return the same array every time. Feel free to experiment here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/exec

if we look at the code:

      if (reg.global) {
        while ((results= reg.exec(msgBody)) !== null) {
          txtResults += `Match[${count}]: ${results[0]}\n`;
          count++;
        }
      }

To me it appears nothing increments the while loop.

reg.exec is creating an array off of msgBody. Nothing alters MsgBody.

it is an assignment, results = reg.exec(msgBody), not a boolean

the test itself is did reg.exec(msgBody) assign a value that is !== to results?

{ some code happens }

in the code that happens, nothing happens that might effect what reg.exec will assign next.

infinite loop. txtResults continues to grow until overflow error occurs.

Galantha commented 1 week ago

Like I said, first time I have done something like this :). Your welcome to do whatever you would like with it.

RealRaven2000 commented 1 week ago

To me it appears nothing increments the while loop.

reg.exec is creating an array off of msgBody. Nothing alters MsgBody.

but the Object reg's state changes. According to to the documentation:

JavaScript RegExp objects are stateful when they have the global or sticky flags set (e.g. /foo/g or /foo/y). They store a lastIndex from the previous match. Using this internally, exec() can be used to iterate over multiple matches in a string of text (with capture groups), as opposed to getting just the matching strings with String.prototype.match().

And:

Finding successive matches If your regular expression uses the g flag, you can use the exec() method multiple times to find successive matches in the same string. When you do so, the search starts at the substring of str specified by the regular expression's lastIndex property (test() will also advance the lastIndex property). Note that the lastIndex property will not be reset when searching a different string, it will start its search at its existing lastIndex.

filtaquilla-4.2.1pre8.zip

just tested it, made a new regular expression that append /gi and it lists multiple results. It also explains why it only crashes sometimes.

Galantha commented 1 week ago

Regular expression flags,

I like this page for a list of javascript flags: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions

you already had case insensitive: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/ignoreCase

I added multi-line to my experimental branch: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/multiline

and I would suggest perhaps adding single line also: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/dotAll

Why did I add multi-line? I am used to working with multi-line. It greatly increases performance by limiting the search area to one line at a time. -> if this is running on the gui thread like I suspect it is, it might cause thunderbird to pause, so this may or may not be important

Galantha commented 1 week ago

edit: without the single line flag, I am not sure it is possible to get past \n, but I could be confused. Notice that in the original function the \n is stripped out.

I specifically did not want to do that with the html version of the function.

I did not want to do this: https://github.com/RealRaven2000/FiltaQuilla/blob/a3554926c0f23908f2e17f67fec8135e4209b0ec/content/filtaquilla-util.js#L595

so either s flag or m flag

Galantha commented 1 week ago

One thing I find problematic at the moment is that for every flavor we are adding new menu items to the search dropdown, it gets very unwieldy, I would rather have a single regex body command and then add the flavors as options somehow - it may mean to modify the (as always single) regex parameter to transport additional switches. Maybe something like this:

I agree.

I love your idea.

I have no idea how to do it.

if your doing a gear like that after the expression, perhaps let the user pick from the following flags: i, m, s, ( u or v but not both, or perhaps just u ) depending on how difficult it is to do that.

Galantha commented 1 week ago

Also I believe the new regex match function are all very similar, I will put the common code into a merged util function so we only have the mime body iterator code once in case it needs fixing later.

Feel free to do as you feel best :)

RealRaven2000 commented 1 week ago

Regular expression flags,

I like this page for a list of javascript flags: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions

you already had case insensitive: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/ignoreCase

According to:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions#advanced_searching_with_flags

this should be already supported - you can add flags in the regex in the filter editor by using the / syntax, so there is no need for a separate function for multiline:

/myexpression/gm

... I just tested it:

Do we need to do anything with the body, e.g. put back line breaks in this case? Haven't really looked at that code in a while (the original is from Kent, I only added the BodyPart stuff)

Galantha commented 1 week ago

lots of text deleted

sorry, it picked up a post I was writing I did not mean to post, sorry about that!

Regular expression flags, I like this page for a list of javascript flags: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions you already had case insensitive: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/ignoreCase

According to:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions#advanced_searching_with_flags

this should be already supported - you can add flags in the regex in the filter editor by using the / syntax, so there is no need for a separate function for multiline:
/myexpression/gm
... I just tested it:

Do we need to do anything with the body, e.g. put back line breaks in this case? Haven't really looked at that code in a while (the original is from Kent, I only added the BodyPart stuff)

I tried this, and was not able to get it to work. One moment.

Galantha commented 1 week ago

so, in the filtaquilta extension, it does not work for me. There is some code that looks like it checks for that: https://github.com/Galantha/GalExpermentFiltaQuilla/blob/f3b76cc6854d59a4e37173214ad456d46f17a627/content/filtaquilla.js#L2500

But I tested it, and it did not work for me.

Galantha commented 1 week ago

I was specifically having issues with this, so I added ${searchValue} and ${searchFlags} logging.

I did not want to do the extra work of adding the multi-line flag check box.

https://github.com/Galantha/GalExpermentFiltaQuilla/blob/f3b76cc6854d59a4e37173214ad456d46f17a627/content/filtaquilla-util.js#L597

https://github.com/Galantha/GalExpermentFiltaQuilla/blob/f3b76cc6854d59a4e37173214ad456d46f17a627/content/filtaquilla-util.js#L606C101-L606C116

the flags I put into filtaquilta were not being carried through correctly.

I also tried (?:flag) format with no luck either if I remember right.

Galantha commented 1 week ago

https://github.com/RealRaven2000/FiltaQuilla/pull/312#issuecomment-2455194599

sorry about this post, I did not realize the top of the post was somehow a half baked post I had previously started writing. I deleted that part.

I had only intended to quote you and post the bottom line.

RealRaven2000 commented 1 week ago

Not sure what is happening, did you try the version I posted over at https://github.com/RealRaven2000/FiltaQuilla/issues/302#issuecomment-2455015949

filtaquilla-4.2.1pre8.zip

RealRaven2000 commented 1 week ago

For testing the "reflow" code which removes reflowing line breaks, I am currently trying to force isQuotedPrintable() to return true but it's almost impossible with Thunderbird composer to force it to encode stuff in ASCII, instead it always converts anything I insert to UTF-8, even if I try to insert as HTML source. I tried some newsgroup postings from eternal september but they also appear to use UTF-8.

So I guess removing line breaks is more of an exception, lots of real line breaks will be always in the body. These are removed in html messages in line 606:

          if (p.includes("<html")) {
            // remove html the dirty way
            p = p.replace(/(<style[\w\W]+style>)/g, '').replace(/<[^>]+>/g, '').replace(/(\r\n|\r|\n){2,}/g,"").replace(/(\t){2,}/g,"");
          }

so maybe the /m flag isn't really needed at least with HTML parts. I am trying to make this a little more resilient for mails written with other mail clients right now...

Galantha commented 1 week ago

For testing the "reflow" code which removes reflowing line breaks, I am currently trying to force isQuotedPrintable() to return true but it's almost impossible with Thunderbird composer to force it to encode stuff in ASCII, instead it always converts anything I insert to UTF-8, even if I try to insert as HTML source. I tried some newsgroup postings from eternal september but they also appear to use UTF-8.

So I guess removing line breaks is more of an exception, lots of real line breaks will be always in the body. These are removed in html messages in line 606:
          if (p.includes("<html")) {
            // remove html the dirty way
            p = p.replace(/(<style[\w\W]+style>)/g, '').replace(/<[^>]+>/g, '').replace(/(\r\n|\r|\n){2,}/g,"").replace(/(\t){2,}/g,"");
          }
so maybe the /m flag isn't really needed at least with HTML parts. I am trying to make this a little more resilient for mails written with other mail clients right now...

instead of doing that nastiness, I feel a better solution would be to pass the /s flag: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/dotAll perhaps even silently, and document it is being done somewhere.

I feel this is a better, simpler, and less likely to cause issues then mangling the input.

Galantha commented 1 week ago

Not sure what is happening, did you try the version I posted over at #302 (comment)

filtaquilla-4.2.1pre8.zip

I have not. It may be several days before I load the test platform, re-attach the debugger, and look at this project closely again.

Galantha commented 1 week ago

You may want to consider silently passing the /s flag to every regular expression in the extension.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/dotAll

Without it, my understanding is regular expressions are not able to search past line breaks. This might leave people a bit confused as to why their expressions are mysteriously failing to detect properly.

Galantha commented 1 week ago

For testing the "reflow" code which removes reflowing line breaks, I am currently trying to force isQuotedPrintable() to return true but it's almost impossible with Thunderbird composer to force it to encode stuff in ASCII, instead it always converts anything I insert to UTF-8, even if I try to insert as HTML source. I tried some newsgroup postings from eternal september but they also appear to use UTF-8.

So I guess removing line breaks is more of an exception, lots of real line breaks will be always in the body. These are removed in html messages in line 606:
          if (p.includes("<html")) {
            // remove html the dirty way
            p = p.replace(/(<style[\w\W]+style>)/g, '').replace(/<[^>]+>/g, '').replace(/(\r\n|\r|\n){2,}/g,"").replace(/(\t){2,}/g,"");
          }
so maybe the /m flag isn't really needed at least with HTML parts. I am trying to make this a little more resilient for mails written with other mail clients right now...

That is pretty brutal with the composer.

Might resort to sadness to achieve ASCII. https://stackoverflow.com/questions/94037/convert-character-to-ascii-code-in-javascript

let sadString = String.fromCharCode(115) + String.fromCharCode(97) + String.fromCharCode(100);

Galantha commented 1 week ago

or maybe? (new RegExp(/quoted-printable/)).test(contentType)

works for me in the tester:

let contentType = "quoted-printable";
console.log((new RegExp(/quoted-printable/).test(contentType)));
contentType = "asdf";
console.log(new RegExp((/quoted-printable/).test(contentType)));

https://developer.mozilla.org/en-US/play

RealRaven2000 commented 1 week ago

instead of doing that nastiness, I feel a better solution would be to pass the /s flag: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/dotAll perhaps even silently, and document it is being done somewhere.

I feel this is a better, simpler, and less likely to cause issues then mangling the input.

the input we get from Thunderbird 128 is super mangled anyway I am looking into breaking it up into mime parts. in 115 we had MimeParser.extractMimeMsg(..) but they took it away and give only a nasty raw message which I need to reparse and rebuild.

RealRaven2000 commented 5 days ago

Just a heads up, there is a new version at:

https://github.com/RealRaven2000/FiltaQuilla/issues/313#issuecomment-2468935487

ready to test. you can add multiline switches and force the regex to only read the html part if necessary:

use the wheel to open the panel, enter desired settings and click accept.

Galantha commented 4 days ago

Thank you!

I will get to this when I can!

which will hopefully be later today, but I am a bit overloaded a the moment, sorry!

RealRaven2000 / FiltaQuilla

Esr225+gal exp add body regex html filter, add regex multiline flag, bug fixes in body regex filter, altered logging body regex filter, syntax error fixing all over the place mainly semicolons #312