deathau / markdownload

A Firefox and Google Chrome extension to clip websites and download them into a readable markdown file.
Apache License 2.0
2.9k stars 225 forks source link

new version makes Medium code blocks unreadable #272

Open Smitty010 opened 11 months ago

Smitty010 commented 11 months ago

It looks like there is a new version. I think there was an attempt to fix the issue with code blocks not rendering properly. However, I would say that the new solution is worse than what it used to do. Here's an example. Look at https://medium.com/mlearning-ai/powerhouse-in-your-pocket-how-tiny-llms-are-redefining-the-ai-landscape-fdf17718bc79.

Here's the first codeblock as it appears in the story image

Here's how it appears in the markdown image

I suspect that the point was to remove the surrounding "```" and let the markdown editor render it. Notice that you lost any comments.

The other bigger problem is that you lose whitespace. For example, from the same story, image

gets turned into (here I've removed the block markers to let it render) image

Without the css of the file, all of the lines start in column 0. Given the importance of white space in python, not very helpful.

I'm sure this can't be an easy problem to solve or it would have happened already. I preferred the old rendering as we didn't have all of the html tags.

deathau commented 11 months ago

Hmm... There were some fixes to issues with rendering HTML in code blocks that looks like it's had some side effects. I'll look into it.

deathau commented 11 months ago

It looks like Medium code blocks are just terrible... Even without syntax highlighting applied, they're wrapped in a <pre><span> (no <code> in sight) and use <br> for newlines (therefore negating the whole point of even using a <pre>). Also, everything has random bundle-generated class names. In short, it's just horrendous in terms of machine readability. I couldn't follow the original example, because Medium wants me to pay, but I did find a simple example here: https://blog.medium.com/code-blocks-with-syntax-highlighting-53343df53c4f in which this: image looks like this in the source:

<pre class="mt mu mv mw mx my mz ni bo nj ba bj"><span id="e759" class="nk ne ew mz b bf nl nm l nn nh" data-selectable-paragraph=""><span class="hljs-comment">// highlighted code is easier to read</span><br><span class="hljs-keyword">function</span> <span class="hljs-title.function">newCodeBlock</span>() {<br>  <span class="hljs-keyword">return</span> “jazzy!”;<br>}</span></pre>

I can see from the classes that it's using highlight.js, but I can't really find anything to help with this case.

Smitty010 commented 11 months ago

It's too bad there isn't a way to effectively highlight the text, copy it and then do a paste. That's how I fix the problems on medium articles (highlight the code block and copy it to markdown). I don't care about the code highlighting in the article. I can just mark the block with the code type and Obsidian does its own highlighting.

I've thought about possible solutions and I can see it's likely a difficult problem.

Sadly, medium is one of my primary sources and so I see the code block problem a lot since a majority of what I clip are programming articles. Wish I could say it wasn't a big deal, but, for me, it is.

On Sun, Dec 10, 2023 at 4:29 PM Gordon Pedersen @.***> wrote:

It looks like Medium code blocks are just terrible... Even without syntax highlighting applied, they're wrapped in a

 (no  in
sight) and use 
for newlines (therefore negating the whole point of even using a
). Also, everything has random bundle-generated class
names. In short, it's just horrendous in terms of machine readability.
I couldn't follow the original example, because Medium wants me to pay,
but I did find a simple example here:
https://blog.medium.com/code-blocks-with-syntax-highlighting-53343df53c4f
in which this:
image.png (view on web)
https://github.com/deathau/markdownload/assets/1421840/55e3e9d0-bb50-4dbf-88b3-019b33d5bfcd
looks like this in the source:

// highlighted code is easier to read
function newCodeBlock() {
return “jazzy!”;
}

I can see from the classes that it's using highlight.js https://highlightjs.org/, but I can't really find anything to help with this case.

— Reply to this email directly, view it on GitHub https://github.com/deathau/markdownload/issues/272#issuecomment-1849124395, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4V7SYFTTWH4KTVH7HHF5TYIZAXBAVCNFSM6AAAAABAOZBVYCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBZGEZDIMZZGU . You are receiving this because you authored the thread.Message ID: @.***>

Smitty010 commented 11 months ago

I took the example I gave above and simply pasted the entire article into an empty markdown file in obsidian. Here's what I got for the code block (raw text; note that there were no code block marks ``` around it.):

# THIS IS FOR THE BACKBONE ONLY  
with gr.Blocks(theme='ParityError/Interstellar') as demo:   
    #TITLE SECTION  
    with gr.Row():  
        with gr.Column(scale=12):  
            gr.HTML  #TITLE  
            gr.Markdown #TEXT and DESRIPTION  
        gr.Image #your image LOGO  
   # chat and parameters settings  
    with gr.Row():  
        with gr.Column(scale=4):  #CHATBOT AND USER INPUT  
            chatbot = gr.Chatbot #Chatbot box  
            with gr.Row(): # ROW for user input and button  
                with gr.Column(scale=14):  
                    msg = gr.Textbox  # User input  
                submitBtn = gr.Button #Submit button  

        with gr.Column(min_width=50,scale=1): #PARAMETERS SECTION  
                with gr.Tab(label="Parameter Setting"):  
                    gr.Markdown("# Parameters")  
                    top_p = gr.Slider  
                    temperature = gr.Slider  
                    max_length_tokens = gr.Slider  
                    rep_pen = gr.Slider  

                clear = gr.Button("🗑️ Clear All Messages", variant='secondary')  

    # HERE we have to create the fucntions to be called by the Push Buttons  
    def user(user_message, history):  

    def bot(history,t,p,m,r):  

    # Clicking the submitBtn will call the generation with Parameters in the slides  
    submitBtn.click #all parameters here  
    clear.click #actions to clear all  

#MAIN CALL SECTION      
demo.queue()  #required to yield the streams from the text generation  
demo.launch(inbrowser=True)

which renders as image

Still a bit wonky (comments sometimes render as headers or lose the #), but acceptable. Even better, I was also able to select it, type three backticks, and get the correct code block (with comments, indents, etc.). I then marked the block as Python and got all of the highlighting. I could live with having to mark the code blocks if the text was there.

I have no idea where all of the transformations of the html->text take place (in chrome, in obsidian paste). I'm not a browser/UI guy.

As I said, probably not an easy fix. Sorry

deathau commented 11 months ago

Thanks for looking into this. From my understanding, Obsidian does the conversion on paste. It seems to be completely ignoring the <pre> tag and therefore not treating it as a code block at all (but it's still keeping the spacing — interesting)

This extension is a little bit different as it runs the document through a "readability" engine first, which strips out unnecessary stuff like headers and footers, but also strips some of the classes and styling before it converts to Markdown.

To that end, if you select all of the text inside of a code block in Medium, then right-click -> Markdownload -> Copy selection as markdown, it seems to behave similarly to Obsidian. But if you select the code block itself (like would happen if you also select text around it), it keeps all the HTML.

I'll look into it further (and if anyone else has any ideas, let me know)

WayneDing commented 11 months ago

Mark

Smitty010 commented 10 months ago

I generated the following templater template

<%*
const curFile = await app.workspace.activeLeaf.view.file;  // get current file
let contents = await app.vault.read(curFile) // get file contents

contents = contents
            .replaceAll(/<span.*?>/g, "")
            .replaceAll("</span>", "")
            .replaceAll("<br>", "\n")
            .replaceAll("<strong>", "**")
            . replaceAll("</strong>", "**")
            .replaceAll("<em>", "*")
            .replaceAll("</em>", "*")
            .replaceAll("&amp;", "%")
            .replaceAll("&lt;", "<")
            .replaceAll("&gt;", ">")

await app.vault.modify(curFile, contents); // *replace content with new content*
-%>

This seems to clean up a lot of the blocks. It obviously, has some limitations

  1. It's pretty stupid in that it doesn't really look for blocks but simply replaces some embedded html command with "appropriate" conversions. I wouldn't use it on an article about html.
  2. It doesn't help with comments that begin with a "#" character as they don't make it into the block at all
  3. I've only tried this on medium articles