Closed Shu-Ting-Huang closed 3 years ago
This is a shortcoming of the syntax colouring library rather than VS Code Printing.
If you can give me a sample that demonstrates the problem I'll be happy to log a bug with the syntax colouring project.
Below is an html sample together with the PDF printout that demonstrate the problem.
https://drive.google.com/drive/folders/1SuTVZryAV9MVRucQ3zmbocVW_Od7WVet?usp=sharing
Just saw this conversation as a matter of interest. Since I cannot access your HTML Sample, this is just a blind guess here, but you might want to check:
This is the HTML Code:
This would be the output (and the wrong /* HTML-comment */ won't show up colored):
It might solve your request - in that case this is NOT a bug - this is a wrong HTML comment.
Sorry, my fault, I have overlooked that you meant "over multiple lines", which also triggers the same no-syntax problem as meantioned above (the wrong HTML comment).
I've made a JSFiddle to first see what happens when using "highlight.js" and if this is a bug related to highlight.js:
The output there seems fine to me - except the fact that it won't show HTML comments - even when HTML is set as class.
But when used through the plugin, it shows this (which looks like a bug inbetween somewhere):
VSC Input VS Extension Output:
<img src="https://user-images.githubusercontent.com/18111492/99781471-30a10300-2b18-11eb-9d33-bdd81791bd3d.jpg" alt="" width="50%" /><img src="https://user-images.githubusercontent.com/18111492/99781496-35fe4d80-2b18-11eb-8f21-d21e7d8b98d8.jpg" alt="" width="50%" />
There are two modes, manual and auto. I use manual whenever VSCode has a definite langId, and for an HTML file that is the case. Manual mode basically means I don't let highlight.js guess, I explicitly tell it "this is HTML"
But that doesn't explain the behaviour you're seeing so I suppose I'll have to spin it up in the debugger and seen exactly what's being sent to highlight.js
Peter I've also tried to do it with Prism, but I've ran into similar troubles.
I've already opened a ticket:
You can see my tryouts in this JSFiddle here (now it works, but it is a quirky workaroud, since comments in HTML will require a plugin, which in my opinion is not the best way of having to deal with):
Again, see my comments in here:
At least the multi-line comments worked for Javascript and CSS.
Digging into the rabbithole ...
Even this nice looking approach
won't work as expected:
Issue has been submitted, but I guess he might not answer, seems like a dead git repo to me:
... to be continued ...
Without wishing to rain on your parade, highlight.js is probably as good as it gets. Stackoverflow just switched to using it.
This is why, despite otherwise being a bit perfectionist about vscode-printing, I think we should let the open-source division of problems thing work for us. Gather your evidence and just log a case with the highlight.js people. We can't do everything and they know their codebase.
Finally I got some feedback for rainbow.js:
This works PERFECTLY NOW:
If you want to include all you can use this command here for a build:
gulp build --languages=all
That might also address potential performance issues ... I know you like highlight.js but it might be worth a try ...
To bundle Craig told me this:
The best bet is using gulp build to create a custom bundle and passing whatever languages you want in a list. You can also use the bundler from the rainbow website, but it is using a slightly out of date version of rainbow (long story) http://rainbowco.de
So Peter, as thought, don't expect 'help' from the guys from highlight.js, as seen in the comment here:
https://github.com/highlightjs/highlight.js/issues/2884
It seems to me they have the very same attitude than we're well aware of the superior VSC forum feedback in our case.
Since I've made that research for various reasons so comprehensive, I'll personally stick with rainbow.js since it not only seems the most reliable code syntax highlighter on the market right now, it also has the nicest and most helpful developer feedback I've encountered so far!
Hope you can find a solution which finally works - give it a try!
The answer came promptly after telling them that it obviously works in rainbow.js, so this is how it "only-works" in highlight.js:
That said, I still prefer the rainbow.js solution, since you do not need to "mask" anything to make it work, since I don't get the whole "I am concerned about security issue" in this specific regard:
@EDIT: since I've had a really good chat with josh about this whole security issue, I see the problem now which might occur when you're using unmasked code.
So Peter, as thought, don't expect 'help' from the guys from highlight.js, as seen in the comment here:
@Neohiro79 FYI: This is the attitude I picked up on that very slightly miffed me. :-)
Just to explain here we do not highlight unescaped HTML code because it is a HUGE security risk and a very bad idea. This is a major feature, not a bug. If you want HTML code to be treated as "text" and highlighted, it must be properly escaped. Otherwise google "html injection", "javascript injection", etc...
IE, correct:
<code>
<code>this is a code tag to be highlithed</code>
</code>
Incorrect:
<code>
<code>hey code here!!!</code>
</code>
I personally believe Rainbow is buggy in this regard or at the very least encouraging/enabling people to make some very poor security choices. I've explained this as clearly as I can to the author and we'll see what they decide. :-) With a very small change to the library they could close this security hole and prevent people from shooting themselves in their own foot.
@PeterWone I'm not sure if this affects your plugin or not... if you are asking us [Highlight.js] to highlight code inside a block you need that code to be properly HTML escaped (which should be pretty trivial to do with JS). The block should not include RAW unescaped HTML - which we are NOT going to highlight.
@josh :-) big hug for all your effort !!!
@Neohiro79 FYI: This is the attitude I picked up on that very slightly miffed me. :-)
Yeah, I got the point, that's what I thought ... but you also opened up for us 'beginners' now, which makes me happy :-)
So in essence if I escape HTML the issue vanishes and we could consider using your library.
Bring back Windows Phone
From: Josh Goebel notifications@github.com Sent: Sunday, November 22, 2020 4:08:48 PM To: PeterWone/vsc-print vsc-print@noreply.github.com Cc: Peter Wone peter.wone@outlook.com; Mention mention@noreply.github.com Subject: Re: [PeterWone/vsc-print] incorrect coloring for html comment across multiple lines (#63)
@PeterWonehttps://github.com/PeterWone I'm not sure if this affects your plugin or not... if you are asking us to highlight code inside a block you need that code to be properly HTML escaped (which should be pretty trivial to do with JS). The block should not include RAW unescaped HTML - which we are not going to highlight.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/PeterWone/vsc-print/issues/63#issuecomment-731704232, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABJ6QOCPIAOYZJPDIFCY52DSRCTHBANCNFSM4QPKYFVA.
That is my current understanding, yes. Highlight.js (currently) preserves HTML - it's passed thru silently - while the actual TEXT content is highlighted... and the HTML remains, doing whatever it would do... this is legacy behavior and in the future we'll likely drop it completely but we can't do that until version 11. If you want all the content to be treated as text (and highlighted) then just escape it.
There is no reason for your block to include anything other than text (ie, properly escaped HTML which is "text" to the browser/runtime) unless you purposely WANT to use HTML to do something:
<code>
<span style="color:red !important;">var x = 5;</span>
var y = 9;
<code>
For example you can do tricks like this... and we'll highlight the text but preserve the HTML so line 1 will always be RED. Some people use this to highlight/emphasize parts of their code, etc...
Oh, that explains why some of them won't show the html
tags but the content, they used innerText only to get the markup of that specific container right?
Oh, that explains why some of them won't show the html tags but the content, they used innerText only to get the markup of that specific container right?
Yes, or textContent
. (which is honestly the correct behavior IMHO - vs innerHTML
)
So the issue with rainbow.js is that they use innerHTML
instead of textContent
to "get all the code" they want to highlight and inject it back into the given container - so basically if the HTML is malicious in first place it will keep beeing malicious with the additional highlighting, correct?
And the fact that they change the HTML during the process won't change the fact they read from innerHTML
the whole code ... ?
Because in that case it makes sense to me to definately warn the user once any input of code is not masked properly in a very clear way (and not in the console) that you can see the mistake and directly put a link to one of those converters to finally get the result someone might expect, just as you proposed ... just my two cents...
So the issue with rainbow.js is that they use innerHTML instead of textContent to "get all the code"
Yes. You want "the text content in a code block" not "all the HTML code in a code block".
so basically if the HTML is malicious in first place it will keep beeing malicious with the additional highlighting, correct?
Well, no.. :-) It's quite likely Rainbow would actually "fix it"... by turning it into harmless highlighted code (albeit in weird ways)... but it's too late... the HTML is loaded BEFORE Rainbow sees it. The JS runs BEFORE Rainbow sees it. Plus you can't guarantee Rainbow will run anyways - perhaps the server (or internet) has a glitch and your page loads without Rainbow.
<pre><code>
<script>
// this code runs BEFORE Rainbow
// this code could even hack Rainbow and prevent it from loading if it wanted to
</script>
</code></pre>
Yeah, got it.
But this also means (to me) that weather Rainbow.js nor Highlight.js can ever prevent anyone from injecting "bad javascript" (of course) which will anyway beeing executed BEFORE any of those methods through the browser engine, weather innerHTML nor textContent nor any masking techniques beeing applied by any javascript library afterwards, will ever come into play.
Meaning that the main problem I still see is that it can only "warn" a user AFTER potential malicious code was injected already, or in other terms, it is much more important to warn users BEFOREHAND that they shall never use untested or unknown php code which might be used in a contact form or a comment form of any plugin since they do not know weather or not this code is beeing properly masked BEFORE it even hits the html page - which in turn will execute the script BEFORE EVEN ANY highlighter script can detect something via innerHTML or innerText - or am I wrong?
@Neohiro79 Right, we can't STOP an injection attack like this "in the wild"... but if someone is building their project and testing things out and they do this accidentally we can yell at them very loudly then - so they fix it early - hopefully before their code every gets out "in the wild"... or even if it's in the wild we can still warm them if they are using raw HTML even innocently. Perhaps they can fix it before they are exploited maliciously.
And we can use textContent
(not innerHTML) so people never even get the idea of using unescaped HTML.
Yeah, got you.
@josh I'll work on a html warning page and provide it to you once I finished which can be used to be injected into the document or as a popup for all to be aware of once they use HTML code in plain form - since I really think this is a problem since not everyone is aware of this shit.
You can sent me the text you might think or come up with, I'll see what I can do, to make it strikingly clear for others.
We don't need anything from you. We have an issue to track this, but we can't do anything about it until 2021 anyways - because we can't break the existing behavior in v10 (which passes HTML thru). If you wanted to drop some text on that issue you could, but we'll likely think about this carefully when we get around to doing it. So you can save the time.
Sure, if there's no help needed I won't help of course. It was just a proposal with a good intension after all.
OK so looking at the code (it's been a while) I have a function getSourceCode()
which ultimately either loads the content of a file or grabs the buffer of the active editor. Two strings come back as an array, some source code and a type. It could contain anthing and it could be toxic.
I give highlight.js
a crack at it
let sourceCode = await getSourceCode();
let renderedCode = "";
try {
renderedCode = hljs.highlight(sourceCode[0], sourceCode[1]).value;
}
catch (err) {
renderedCode = hljs.highlightAuto(sourceCode[1]).value;
}
and as you can see if I cannot map the language to something highlight.js
can digest then I let highlight.js
off the leash and let it make up its own mind.
This does have the vulnerability we discussed, both for accidentally toxic code and for handling the work of others.
@joshgoebel do I understand correctly that for HTML you want me to escape all angle brackets to to <
and >
before passing the HTML to highlight.js
?
I can't do this for unsaved editor buffers because I don't know what they are, these will always be processed in auto mode.
This does have the vulnerability we discussed, both for accidentally toxic code and for handling the work of others.
I was never asking you to do anything magic, just upgrade to v10, which fixes many of the issues. :-) If our library still has regex issues, then yeah it's possible you'll find them - but that's on us, not you. :-)
do I understand correctly that for HTML you want me to escape all angle brackets to to < and > before passing the HTML to highlight.js?
That's only if your HTML is already on HTML page inside a block, etc... if you're using the API functions directly (highlight, highlightAuto, etc) then you pass them regular text/code whatever... there is zero need to do any type of escaping inside JS. And we escape the output.
Escaping matters if you are using highlightBlock
.
@joshgoebel Well then... v10 here we come.
Well, regression tests all pass. I'm pushing out 0.8.3
Thanks again @Josh (whoever you are :-))
@Neohiro79 0.8.3 should appear soon. Let me know if you encounter any glitches.
Seems to still have some highlight-troubles
when it comes to multiline comments :-(
It's not the end of the world. We can look into rainbow etc but I don't want to lose the auto detect capability because I need it for unsaved files
What trouble? I’m pretty sure that we highlight regular HTML code including comments very well.
@joshgoebel It looks like the multiline comments won't get highlighted properly,
the first part is green <!--
the last part is grey -->
as far as I can see in the screenshot.
As well as being a different colour, the first line of the multiline comment in Neo's screen grab is italicised whereas subsequent lines of the comment are not italic.
It's this eye for detail that makes Neo one of my most valued contributors. I might never have noticed on my own.
Yeah, you're right @peter, even me didn't saw that one coming ;-)
... indeed it's italic, the first <--
three chars
Thanx for the compliment by the way, I give my best to annoy everyone, including myself from time to time :-)
Pretty sure that's not our bug, we know how to highlight basic XML pretty well. :)
Perhaps you're using auto detect and not the XML highlighter?
Here is an example of XML highlighter usage:
@josh Is it correct to use the vbscript-html.min.js
include for proper html highlighting or is there another one - because that's the only one I've found on the CDN ... ?
You can also control the size precisely if you throw out a few large languages that no one is probably using. Using all 190 is probably overkill.
Also, which subset of precompiled language scripts would you recommend out of all 190 different ones?
@josh Is it correct to use the vbscript-html.min.js include for proper html highlighting or is there another one - because that's the only one I've found on the CDN ... ?
html
(an alias of xml
)... But using vbscript-html
wouldn't mess up like what you are showing. It just uses XML internally plus allows for VBScript tags (if they are around).
Also, which subset of precompiled language scripts would you recommend out of all 190 different ones?
We have a "common" set, but now way to get at that easily via Node.js (that should probably be added). If you really want to know you could download the web version and grep it for "registerLanguage" and you'd see the list. But there are lots of things common doesn't include... if you're going to make a sublist you should probably eyeball it.
If you're using the web version you get the common set by default.
Here is an example of XML highlighter usage:
Yes looks perfect to me, unless I'm missing something?
We have a "common" set, but now way to get at that easily via Node.js (that should probably be added). If you really want to know you could download the web version and grep it for "registerLanguage" and you'd see the list. But there are lots of things common doesn't include... if you're going to make a sublist you should probably eyeball it.
@ josh So you mean we have to basically cherrypick the languages from the github source
https://github.com/highlightjs/highlight.js/tree/master/src/languages
and then bundle them together to an individual VSC-build
for the most used languages in this very editor?
That's the source I've found and used:
By the way html
is not in this repo, only htmlbars.js
and the previous mentioned xml
which would be the correct one to use for html, css and javascript in this case then?
From:
https://github.com/highlightjs/highlight.js/tree/master/src/languages
That's why I ship all of it and use the vs code language identifier with a fallback to auto.
It is possible for extensions to register new language identifiers and there is no reason to expect them to be consistent with highlight.js but auto means that oddball things like Swift and Kotlin work.
@joshgoebel indulge my curiosity, do you support APL?
I am printing an html file in which there are comments for the form
<!--comment-->
/*comment*/
But only the first line of the comment is correctly colored as comment.