Converter is adding ">" before all bullets, as well as extra lines

nixsee commented 4 years ago

Thanks again for making this. Its a major improvement over my previous workflow of Onenote->Evernote-> Notion or Joplin. Though I've found what I consider to be a couple bugs and I hope you can help!

Here's a few screenshots. Original:

After search and replace for "> ", to get rid of the quote blocks:

After removing extra lines:

The search and replace isn't a huge deal, though it becomes a trickier regex job when there are links and other formatting in the Onenote that use < >.

The extra lines are much more of a nuisance as it messes up the formatting of the bullet lists.

Anything that can be done about this?

SjoerdV commented 4 years ago

So your question is about what's in the content of the files. This is something this script does not manage.

That said you could:

check if the problem is already in the Word files generated by OneNote Publish routine by commenting out the Remove-Item statement (line 203)
play around with the Docx2Md conversion that 'pandoc.exe' handles (script line 132) maybe you could do better than the GFM output I used... but there are a lot of options

johnkyle4 commented 4 years ago

In my testing of this issue (per Sjoerd's suggestions) the Word docs come out perfectly. Unordered lists have bullets, ordered lists have numbers. So it's time to play around with pandoc!

Thank you @SjoerdV for your answers and guidance.

johnkyle4 commented 4 years ago

I played around with the various Pandoc output format options and found that gfm and commonmark don't fix the bullet problem, but the others do (except I didn't try markdown_phpextra.)

The markdown option writes bullets as - and also doesn't write any raw HTML (like <table><tr><td>wtf</td></tr></table>) which is what I'm after.

None of them removed the blank lines between list items that the OP mentioned, but I can live with that.

https://pandoc.org/MANUAL.html#options

-t FORMAT, -w FORMAT, --to=FORMAT, --write=FORMAT Specify output format. FORMAT can be: [list trimmed to markdown options]

commonmark (CommonMark Markdown)

gfm (GitHub-Flavored Markdown), or the deprecated and less accurate markdown_github; use markdown_github only if you need extensions not supported in gfm.

markdown (Pandoc’s Markdown)

markdown_mmd (MultiMarkdown)

markdown_phpextra (PHP Markdown Extra)

markdown_strict (original unextended Markdown)`

nixsee commented 4 years ago

Thanks for the responses and tinkering! Same experience for me - no issues with Word, so its a markdown conversion issue. I'll fiddle around with those options and see if I can get something that works for me. Maybe its even possible to modify the Pandoc options, or talk to Pandoc about getting something modified.

One unrelated suggestion - might be worth making it explicit in the instructions (both in github readme as well as even in the script prompts) that OneNote needs to be opened as administrator. I was going crazy not being able to run the modified script until I realized it was an admin thing. I've now set my OneNote to auto-run as Admin, because I am bound to forget again.

nixsee commented 4 years ago

The single vs double spacing seems to be related to "compact lists", described here

I assume that each bullet point in the intermediate docx file is treated like a paragraph, since it would have a paragraph mark (^p) at the end of each line, thus it automatically creates a loose, double-spaced, list.

I came across this post that describes how to get single spacing when going from docx to md, but I'm not smart enough to get it to work. Perhaps one of you are able to?

My issues are:

The pandoc syntax in that post differs greatly from our script, so not sure exactly how to integrate it
I'm not sure where to save the lua file (but ended up using a full filepath reference to find it), but it doesn't seem to work anyway.

But since I am not getting errors anymore, perhaps:

Maybe the lua code doesn't even work? I've also found this script that was built off the original, but it doesn't work for me either.

Here's the line I've used to at least not generate errors:

pandoc.exe -f docx -t markdown -L C:\Users...\plaintext.lua -i $fullexportpath -o "$($fullexportpathwithoutextension).md" --wrap=none --atx-headers --extract-media="$($fullexportdirpath)"

I had even less success with the haskell script, which seems to need you to install and use haskell, which I can't figure out

SjoerdV commented 4 years ago

Thanks for the responses and tinkering! Same experience for me - no issues with Word, so its a markdown conversion issue. I'll fiddle around with those options and see if I can get something that works for me. Maybe its even possible to modify the Pandoc options, or talk to Pandoc about getting something modified.

One unrelated suggestion - might be worth making it explicit in the instructions (both in github readme as well as even in the script prompts) that OneNote needs to be opened as administrator. I was going crazy not being able to run the modified script until I realized it was an admin thing. I've now set my OneNote to auto-run as Admin, because I am bound to forget again.

Hi @nixsee about the 'running as admin' thing, that really depends on your individual security settings, as I was able to do everything in 'normal' mode when both UAC and powershell execution policy are tweaked. Therefore thanks for your suggestion but I'll leave the documentation as it is right now as it's more of a Windows configuration thing not related to the script.

SjoerdV commented 4 years ago

The single vs double spacing seems to be related to "compact lists", described here

I assume that each bullet point in the intermediate docx file is treated like a paragraph, since it would have a paragraph mark (^p) at the end of each line, thus it automatically creates a loose, double-spaced, list.

I came across this post that describes how to get single spacing when going from docx to md, but I'm not smart enough to get it to work. Perhaps one of you are able to?

My issues are:

The pandoc syntax in that post differs greatly from our script, so not sure exactly how to integrate it

I'm not sure where to save the lua file (but ended up using a full filepath reference to find it), but it doesn't seem to work anyway.

But since I am not getting errors anymore, perhaps:

Maybe the lua code doesn't even work? I've also found this script that was built off the original, but it doesn't work for me either.

Here's the line I've used to at least not generate errors:

pandoc.exe -f docx -t markdown -L C:\Users...\plaintext.lua -i $fullexportpath -o "$($fullexportpathwithoutextension).md" --wrap=none --atx-headers --extract-media="$($fullexportdirpath)"

I had even less success with the haskell script, which seems to need you to install and use haskell, which I can't figure out

Hi @nixsee. Great you figured that all out, and thanks for sharing as this is now a valuable resource on this repository. I know the pandoc output is not absolutely fantastic but it works and it adheres to the GFM markup. I would just leave the exported files and update them only when you need something 'neat' again. At least the notes are now plaintext, searchable with no vendor lock-in, so you can do anything with them at any time.

Of course you will not be sharing the markdown output with anyone and always convert it to pdf which is possible which you can do with the nice vscode 'manuth.markdown-converter' extension, and you will probably be using additional markdown syntax provided in 'jebbs.markdown-extended' to really make nice documents. I would focus my efforts in getting good at MarkDown and make beautiful content. Start with the notebook setup I mentioned in the Recommendations section of the README.

For instance using the 'Markdown Extended 'Admonition' Extended Syntax' provides really cool and professional looking sections, which are great for exporting to PDF. Or if you intend to host your Markdown files (with GitHub Pages using a Jekyll server) get into learning that syntax as well. Lots to do!

nixsee commented 4 years ago

Thanks very much! This has been a godsend and I'm just being picky ;)

As it turns out, I just discovered that can just do a global find/replace for double spaces by pressing ctrl+enter in the find/replace text boxes (I had tried shift and alt, but not ctrl!).

Not perfect, but combining this with changing to the pandoc "markdown" converter, as suggested by @johnkyle4, I'm 99% of the way there! Now I just have to sort through my mountains of notes and turn them into something useful, which means I'm actually 1% of the way there...

Thanks again.

nixsee commented 4 years ago

Better yet, using "^\h*\R" in regex in notepad++ (though I'm sure you can do something similar in vscode) will clear all blank lines and can be done at the folder level. Gets rid of lines between paragraphs, but its good enough for me.

the pandoc "markdown" converter adds the "\" escape in front of many symbols, making it hard to search for things and annoying to look at, but it renders properly and, as @johnkyle4 said, it doesn't use any html code for tables etc... Good enough.

SjoerdV commented 4 years ago

Cool stuff with the regex indeed! VScode has a very good find & replace function including regex, so your good! . Will continue to close this if its alright by you. Lots of luck with your 'text cleaning'

SjoerdV / ConvertOneNote2MarkDown

Converter is adding ">" before all bullets, as well as extra lines #8