PDConSec / vsc-print

Completely local print support for VS Code with syntax-colouring and line numbering
Other
70 stars 28 forks source link

incorrect coloring for html comment across multiple lines #63

Closed Shu-Ting-Huang closed 3 years ago

Shu-Ting-Huang commented 4 years ago

I am printing an html file in which there are comments for the form

  1. <!--comment-->

  2. /*comment*/

But only the first line of the comment is correctly colored as comment.

PeterWone commented 4 years ago

This is a shortcoming of the syntax colouring library rather than VS Code Printing.

If you can give me a sample that demonstrates the problem I'll be happy to log a bug with the syntax colouring project.

Shu-Ting-Huang commented 4 years ago

Below is an html sample together with the PDF printout that demonstrate the problem.

https://drive.google.com/drive/folders/1SuTVZryAV9MVRucQ3zmbocVW_Od7WVet?usp=sharing

Neohiro79 commented 3 years ago

Just saw this conversation as a matter of interest. Since I cannot access your HTML Sample, this is just a blind guess here, but you might want to check:

This is the HTML Code:

VSC-Code

This would be the output (and the wrong /* HTML-comment */ won't show up colored): Printout-Style


It might solve your request - in that case this is NOT a bug - this is a wrong HTML comment.


Neohiro79 commented 3 years ago

Sorry, my fault, I have overlooked that you meant "over multiple lines", which also triggers the same no-syntax problem as meantioned above (the wrong HTML comment).

I've made a JSFiddle to first see what happens when using "highlight.js" and if this is a bug related to highlight.js:

http://jsfiddle.net/neohiro79/zbfdhxjy/4/


The output there seems fine to me - except the fact that it won't show HTML comments - even when HTML is set as class.

But when used through the plugin, it shows this (which looks like a bug inbetween somewhere):

VSC Input VS Extension Output:

<img src="https://user-images.githubusercontent.com/18111492/99781471-30a10300-2b18-11eb-9d33-bdd81791bd3d.jpg" alt="" width="50%" /><img src="https://user-images.githubusercontent.com/18111492/99781496-35fe4d80-2b18-11eb-8f21-d21e7d8b98d8.jpg" alt="" width="50%" />

PeterWone commented 3 years ago

There are two modes, manual and auto. I use manual whenever VSCode has a definite langId, and for an HTML file that is the case. Manual mode basically means I don't let highlight.js guess, I explicitly tell it "this is HTML"

But that doesn't explain the behaviour you're seeing so I suppose I'll have to spin it up in the debugger and seen exactly what's being sent to highlight.js

Neohiro79 commented 3 years ago

Peter I've also tried to do it with Prism, but I've ran into similar troubles.

I've already opened a ticket:

https://github.com/PrismJS/prism/issues/2651

You can see my tryouts in this JSFiddle here (now it works, but it is a quirky workaroud, since comments in HTML will require a plugin, which in my opinion is not the best way of having to deal with):

http://jsfiddle.net/neohiro79/zbfdhxjy/3/

Again, see my comments in here:

https://github.com/PrismJS/prism/issues/2651

At least the multi-line comments worked for Javascript and CSS.

Neohiro79 commented 3 years ago

Digging into the rabbithole ...

Even this nice looking approach

https://asvd.github.io/microlight/

won't work as expected:

http://jsfiddle.net/neohiro79/zbfdhxjy/5/

Issue has been submitted, but I guess he might not answer, seems like a dead git repo to me:

https://github.com/asvd/microlight/issues/12

... to be continued ...

PeterWone commented 3 years ago

Without wishing to rain on your parade, highlight.js is probably as good as it gets. Stackoverflow just switched to using it.

This is why, despite otherwise being a bit perfectionist about vscode-printing, I think we should let the open-source division of problems thing work for us. Gather your evidence and just log a case with the highlight.js people. We can't do everything and they know their codebase.

Neohiro79 commented 3 years ago

Finally I got some feedback for rainbow.js:

This works PERFECTLY NOW:

http://jsfiddle.net/neohiro79/zbfdhxjy/6/

If you want to include all you can use this command here for a build: gulp build --languages=all

That might also address potential performance issues ... I know you like highlight.js but it might be worth a try ...

Neohiro79 commented 3 years ago

To bundle Craig told me this:

The best bet is using gulp build to create a custom bundle and passing whatever languages you want in a list. You can also use the bundler from the rainbow website, but it is using a slightly out of date version of rainbow (long story) http://rainbowco.de

Neohiro79 commented 3 years ago

So Peter, as thought, don't expect 'help' from the guys from highlight.js, as seen in the comment here:

https://github.com/highlightjs/highlight.js/issues/2884

It seems to me they have the very same attitude than we're well aware of the superior VSC forum feedback in our case.

Since I've made that research for various reasons so comprehensive, I'll personally stick with rainbow.js since it not only seems the most reliable code syntax highlighter on the market right now, it also has the nicest and most helpful developer feedback I've encountered so far!

Hope you can find a solution which finally works - give it a try!

Neohiro79 commented 3 years ago

The answer came promptly after telling them that it obviously works in rainbow.js, so this is how it "only-works" in highlight.js:

https://jsfiddle.net/neohiro79/zbfdhxjy/12/

That said, I still prefer the rainbow.js solution, since you do not need to "mask" anything to make it work, since I don't get the whole "I am concerned about security issue" in this specific regard:

https://github.com/ccampbell/rainbow/issues/249


@EDIT: since I've had a really good chat with josh about this whole security issue, I see the problem now which might occur when you're using unmasked code.


joshgoebel commented 3 years ago

So Peter, as thought, don't expect 'help' from the guys from highlight.js, as seen in the comment here:

@Neohiro79 FYI: This is the attitude I picked up on that very slightly miffed me. :-)

Just to explain here we do not highlight unescaped HTML code because it is a HUGE security risk and a very bad idea. This is a major feature, not a bug. If you want HTML code to be treated as "text" and highlighted, it must be properly escaped. Otherwise google "html injection", "javascript injection", etc...

IE, correct:

<code>
&lt;code&gt;this is a code tag to be highlithed&lt;/code&gt;
</code>

Incorrect:

<code>
<code>hey code here!!!</code>
</code>

I personally believe Rainbow is buggy in this regard or at the very least encouraging/enabling people to make some very poor security choices. I've explained this as clearly as I can to the author and we'll see what they decide. :-) With a very small change to the library they could close this security hole and prevent people from shooting themselves in their own foot.

joshgoebel commented 3 years ago

@PeterWone I'm not sure if this affects your plugin or not... if you are asking us [Highlight.js] to highlight code inside a block you need that code to be properly HTML escaped (which should be pretty trivial to do with JS). The block should not include RAW unescaped HTML - which we are NOT going to highlight.

Neohiro79 commented 3 years ago

@josh :-) big hug for all your effort !!!

@Neohiro79 FYI: This is the attitude I picked up on that very slightly miffed me. :-)

Yeah, I got the point, that's what I thought ... but you also opened up for us 'beginners' now, which makes me happy :-)

PeterWone commented 3 years ago

So in essence if I escape HTML the issue vanishes and we could consider using your library.

Bring back Windows Phone


From: Josh Goebel notifications@github.com Sent: Sunday, November 22, 2020 4:08:48 PM To: PeterWone/vsc-print vsc-print@noreply.github.com Cc: Peter Wone peter.wone@outlook.com; Mention mention@noreply.github.com Subject: Re: [PeterWone/vsc-print] incorrect coloring for html comment across multiple lines (#63)

@PeterWonehttps://github.com/PeterWone I'm not sure if this affects your plugin or not... if you are asking us to highlight code inside a block you need that code to be properly HTML escaped (which should be pretty trivial to do with JS). The block should not include RAW unescaped HTML - which we are not going to highlight.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/PeterWone/vsc-print/issues/63#issuecomment-731704232, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABJ6QOCPIAOYZJPDIFCY52DSRCTHBANCNFSM4QPKYFVA.

joshgoebel commented 3 years ago

That is my current understanding, yes. Highlight.js (currently) preserves HTML - it's passed thru silently - while the actual TEXT content is highlighted... and the HTML remains, doing whatever it would do... this is legacy behavior and in the future we'll likely drop it completely but we can't do that until version 11. If you want all the content to be treated as text (and highlighted) then just escape it.

There is no reason for your block to include anything other than text (ie, properly escaped HTML which is "text" to the browser/runtime) unless you purposely WANT to use HTML to do something:

<code>
<span style="color:red !important;">var x = 5;</span>
var y = 9;
<code>

For example you can do tricks like this... and we'll highlight the text but preserve the HTML so line 1 will always be RED. Some people use this to highlight/emphasize parts of their code, etc...

Neohiro79 commented 3 years ago

Oh, that explains why some of them won't show the html tags but the content, they used innerText only to get the markup of that specific container right?

joshgoebel commented 3 years ago

Oh, that explains why some of them won't show the html tags but the content, they used innerText only to get the markup of that specific container right?

Yes, or textContent. (which is honestly the correct behavior IMHO - vs innerHTML)

Neohiro79 commented 3 years ago

So the issue with rainbow.js is that they use innerHTML instead of textContent to "get all the code" they want to highlight and inject it back into the given container - so basically if the HTML is malicious in first place it will keep beeing malicious with the additional highlighting, correct?

And the fact that they change the HTML during the process won't change the fact they read from innerHTML the whole code ... ?

Because in that case it makes sense to me to definately warn the user once any input of code is not masked properly in a very clear way (and not in the console) that you can see the mistake and directly put a link to one of those converters to finally get the result someone might expect, just as you proposed ... just my two cents...

joshgoebel commented 3 years ago

So the issue with rainbow.js is that they use innerHTML instead of textContent to "get all the code"

Yes. You want "the text content in a code block" not "all the HTML code in a code block".

so basically if the HTML is malicious in first place it will keep beeing malicious with the additional highlighting, correct?

Well, no.. :-) It's quite likely Rainbow would actually "fix it"... by turning it into harmless highlighted code (albeit in weird ways)... but it's too late... the HTML is loaded BEFORE Rainbow sees it. The JS runs BEFORE Rainbow sees it. Plus you can't guarantee Rainbow will run anyways - perhaps the server (or internet) has a glitch and your page loads without Rainbow.

<pre><code>
<script>
// this code runs BEFORE Rainbow
// this code could even hack Rainbow and prevent it from loading if it wanted to
</script>
</code></pre>
Neohiro79 commented 3 years ago

Yeah, got it.

But this also means (to me) that weather Rainbow.js nor Highlight.js can ever prevent anyone from injecting "bad javascript" (of course) which will anyway beeing executed BEFORE any of those methods through the browser engine, weather innerHTML nor textContent nor any masking techniques beeing applied by any javascript library afterwards, will ever come into play.

Meaning that the main problem I still see is that it can only "warn" a user AFTER potential malicious code was injected already, or in other terms, it is much more important to warn users BEFOREHAND that they shall never use untested or unknown php code which might be used in a contact form or a comment form of any plugin since they do not know weather or not this code is beeing properly masked BEFORE it even hits the html page - which in turn will execute the script BEFORE EVEN ANY highlighter script can detect something via innerHTML or innerText - or am I wrong?

joshgoebel commented 3 years ago

@Neohiro79 Right, we can't STOP an injection attack like this "in the wild"... but if someone is building their project and testing things out and they do this accidentally we can yell at them very loudly then - so they fix it early - hopefully before their code every gets out "in the wild"... or even if it's in the wild we can still warm them if they are using raw HTML even innocently. Perhaps they can fix it before they are exploited maliciously.

And we can use textContent (not innerHTML) so people never even get the idea of using unescaped HTML.

Neohiro79 commented 3 years ago

Yeah, got you.

Neohiro79 commented 3 years ago

@josh I'll work on a html warning page and provide it to you once I finished which can be used to be injected into the document or as a popup for all to be aware of once they use HTML code in plain form - since I really think this is a problem since not everyone is aware of this shit.

You can sent me the text you might think or come up with, I'll see what I can do, to make it strikingly clear for others.

joshgoebel commented 3 years ago

We don't need anything from you. We have an issue to track this, but we can't do anything about it until 2021 anyways - because we can't break the existing behavior in v10 (which passes HTML thru). If you wanted to drop some text on that issue you could, but we'll likely think about this carefully when we get around to doing it. So you can save the time.

Neohiro79 commented 3 years ago

Sure, if there's no help needed I won't help of course. It was just a proposal with a good intension after all.

PeterWone commented 3 years ago

OK so looking at the code (it's been a while) I have a function getSourceCode() which ultimately either loads the content of a file or grabs the buffer of the active editor. Two strings come back as an array, some source code and a type. It could contain anthing and it could be toxic.

I give highlight.js a crack at it

  let sourceCode = await getSourceCode();
  let renderedCode = "";
  try {
    renderedCode = hljs.highlight(sourceCode[0], sourceCode[1]).value;
  }
  catch (err) {
    renderedCode = hljs.highlightAuto(sourceCode[1]).value;
  }

and as you can see if I cannot map the language to something highlight.js can digest then I let highlight.js off the leash and let it make up its own mind.

This does have the vulnerability we discussed, both for accidentally toxic code and for handling the work of others.

@joshgoebel do I understand correctly that for HTML you want me to escape all angle brackets to to &lt; and &gt; before passing the HTML to highlight.js?

I can't do this for unsaved editor buffers because I don't know what they are, these will always be processed in auto mode.

joshgoebel commented 3 years ago

This does have the vulnerability we discussed, both for accidentally toxic code and for handling the work of others.

I was never asking you to do anything magic, just upgrade to v10, which fixes many of the issues. :-) If our library still has regex issues, then yeah it's possible you'll find them - but that's on us, not you. :-)

do I understand correctly that for HTML you want me to escape all angle brackets to to < and > before passing the HTML to highlight.js?

That's only if your HTML is already on HTML page inside a block, etc... if you're using the API functions directly (highlight, highlightAuto, etc) then you pass them regular text/code whatever... there is zero need to do any type of escaping inside JS. And we escape the output.

Escaping matters if you are using highlightBlock.

PeterWone commented 3 years ago

@joshgoebel Well then... v10 here we come.

PeterWone commented 3 years ago

Well, regression tests all pass. I'm pushing out 0.8.3

Neohiro79 commented 3 years ago

Thanks again @Josh (whoever you are :-))

PeterWone commented 3 years ago

@Neohiro79 0.8.3 should appear soon. Let me know if you encounter any glitches.

Neohiro79 commented 3 years ago

Seems to still have some highlight-troubles when it comes to multiline comments :-( still-problems-with-multiline-comments

PeterWone commented 3 years ago

It's not the end of the world. We can look into rainbow etc but I don't want to lose the auto detect capability because I need it for unsaved files

joshgoebel commented 3 years ago

What trouble? I’m pretty sure that we highlight regular HTML code including comments very well.

Neohiro79 commented 3 years ago

@joshgoebel It looks like the multiline comments won't get highlighted properly, the first part is green <!-- the last part is grey --> as far as I can see in the screenshot.

PeterWone commented 3 years ago

As well as being a different colour, the first line of the multiline comment in Neo's screen grab is italicised whereas subsequent lines of the comment are not italic.

It's this eye for detail that makes Neo one of my most valued contributors. I might never have noticed on my own.

Neohiro79 commented 3 years ago

Yeah, you're right @peter, even me didn't saw that one coming ;-)

... indeed it's italic, the first <-- three chars

Neohiro79 commented 3 years ago

Thanx for the compliment by the way, I give my best to annoy everyone, including myself from time to time :-)

joshgoebel commented 3 years ago
Screen Shot 2020-11-22 at 2 27 49 PM

Pretty sure that's not our bug, we know how to highlight basic XML pretty well. :)

joshgoebel commented 3 years ago

Perhaps you're using auto detect and not the XML highlighter?

Neohiro79 commented 3 years ago

Here is an example of XML highlighter usage:

http://jsfiddle.net/neohiro79/zbfdhxjy/31/

Neohiro79 commented 3 years ago

@josh Is it correct to use the vbscript-html.min.js include for proper html highlighting or is there another one - because that's the only one I've found on the CDN ... ?

Neohiro79 commented 3 years ago

You can also control the size precisely if you throw out a few large languages that no one is probably using. Using all 190 is probably overkill.

Also, which subset of precompiled language scripts would you recommend out of all 190 different ones?

joshgoebel commented 3 years ago

@josh Is it correct to use the vbscript-html.min.js include for proper html highlighting or is there another one - because that's the only one I've found on the CDN ... ?

html (an alias of xml)... But using vbscript-html wouldn't mess up like what you are showing. It just uses XML internally plus allows for VBScript tags (if they are around).

Also, which subset of precompiled language scripts would you recommend out of all 190 different ones?

We have a "common" set, but now way to get at that easily via Node.js (that should probably be added). If you really want to know you could download the web version and grep it for "registerLanguage" and you'd see the list. But there are lots of things common doesn't include... if you're going to make a sublist you should probably eyeball it.

If you're using the web version you get the common set by default.

joshgoebel commented 3 years ago

Here is an example of XML highlighter usage:

Yes looks perfect to me, unless I'm missing something?

Neohiro79 commented 3 years ago

We have a "common" set, but now way to get at that easily via Node.js (that should probably be added). If you really want to know you could download the web version and grep it for "registerLanguage" and you'd see the list. But there are lots of things common doesn't include... if you're going to make a sublist you should probably eyeball it.

@ josh So you mean we have to basically cherrypick the languages from the github source

https://github.com/highlightjs/highlight.js/tree/master/src/languages

and then bundle them together to an individual VSC-build for the most used languages in this very editor?

That's the source I've found and used:

https://cdnjs.com/libraries/highlight.js

Neohiro79 commented 3 years ago

By the way html is not in this repo, only htmlbars.js and the previous mentioned xml which would be the correct one to use for html, css and javascript in this case then?


html-lib


From:

https://github.com/highlightjs/highlight.js/tree/master/src/languages

PeterWone commented 3 years ago

That's why I ship all of it and use the vs code language identifier with a fallback to auto.

It is possible for extensions to register new language identifiers and there is no reason to expect them to be consistent with highlight.js but auto means that oddball things like Swift and Kotlin work.

@joshgoebel indulge my curiosity, do you support APL?