FinalsClub / karmaworld

KarmaNotes.org v3.0
GNU Affero General Public License v3.0
7 stars 6 forks source link

html2text creates ill formatted markdown when user has space delimited formatting #356

Closed JHilker closed 9 years ago

JHilker commented 10 years ago

Related to https://github.com/FinalsClub/karmaworld/issues/332

Hack was introduced into global.css to make code blocks generated by marked.js look like normal text. https://github.com/FinalsClub/karmaworld/pull/355

Our branch of html2text currently creates a full line break for every p tag in the google doc instead of ever using soft line breaks. This decision was made to increase the likelihood of good formatting for users who never use newlines in between sections in their notes.

However, in cases where whitespace is used to format these lines, markdown interprets this as a code block due to the new line above it. In these cases we should not have a newline above the space formatted line, and should instead use a soft line break (two spaces.)

This would require either checking the nested contents of everything included in the tag we are parsing, or possibly doing a regex search at the end. Neither of these scenarios are ideal.

I think this is for the most part a corner case we shouldn't stress over unless it becomes clear that it is a common issue amongst our users.

AndrewMagliozzi commented 10 years ago

I'm surprised there isn't an option to pass HTML2Text upon conversion to remove code-highlighting. Or, is this an issue with the client-side rendering framework?

On Wed, Mar 12, 2014 at 4:34 PM, Jacob Hilker notifications@github.comwrote:

Related to #332 https://github.com/FinalsClub/karmaworld/issues/332

Hack was introduced into global.css to make code blocks generated by marked.js look like normal text. #355https://github.com/FinalsClub/karmaworld/pull/355

Our branch of html2text currently creates a full line break for every p tag in the google doc instead of ever using soft line breaks. This decision was made to increase the likelihood of good formatting for users who never use newlines in between sections in their notes.

However, in cases where whitespace is used to format these lines, markdown interprets this as a code block due to the new line above it. In these cases we should not have a newline above the space formatted line, and should instead use a soft line break (two spaces.)

This would require either checking the nested contents of everything included in the tag we are parsing, or possibly doing a regex search at the end. Neither of these scenarios are ideal.

I think this is for the most part a corner case we shouldn't stress over unless it becomes clear that it is a common issue amongst our users.

Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/356 .

JHilker commented 10 years ago

The issue lies with how HTML2Text is adding new lines for each p tag in the document to format it. In cases where the next p tag contains leading white space, this becomes a code tag because of that whitespace. It's not deliberately trying to make them. Originally HTML2Text was designed to only do these hard breaks only when there was an empty line in the document, and did soft breaks otherwise. This however caused serious formatting issues if users didn't add empty lines to their notes at all. So I elected to always use hard breaks, since lack of whitespace is probably a far more common use case then a user using tabs/spaces for format their indentation instead of bullets.

The client-side rendering framework is correct in interpreting markdown like that as code blocks, as that is valid markdown syntax. I wouldn't want to modify how it is working, since it's technically correct. (Though IIRC there is a way to tell it how to highlight code blocks via passing in a custom method)

Quick example of what happens now

Example Input

Note1
Note2
    IndentedNote

Current HTML2Text output. IndentedNote is considered a code block

Note1

Note2

    IndentedNote

Desired HTML2Text output. IndentedNote is soft breaked to the next line

Note1

Note2  
     IndentedNote

Thinking about it now, it may also be possible to going back to softbreaks only, and fixing the formatting issues there instead. They mostly occurred when the source document didn't have a new line between the end of a list, and the next p tag. So maybe if we force a hard break at the end of every list, that might work? This ~might~ be the easiest solution if we decide to go forward with resolving this ticket, but needs further investigation and testing.

AndrewMagliozzi commented 10 years ago

Thanks for the heads up, Jacob. Did forcing the heard break after a list work? I don't think we'll have to worry about code blocks at all.

On Sun, Mar 16, 2014 at 2:56 PM, Jacob Hilker notifications@github.comwrote:

The issue lies with how HTML2Text is adding new lines for each p tag in the document to format it. In cases where the next p tag contains leading white space, this becomes a code tag because of that whitespace. It's not deliberately trying to make them. Originally HTML2Text was designed to only do these hard breaks only when there was an empty line in the document, and did soft breaks otherwise. This however caused serious formatting issues if users didn't add empty lines to their notes at all. So I elected to always use hard breaks, since lack of whitespace is probably a far more common use case then a user using tabs/spaces for format their indentation instead of bullets.

The client-side rendering framework is correct in interpreting markdown like that as code blocks, as that is valid markdown syntax. I wouldn't want to modify how it is working, since it's technically correct. (Though IIRC there is a way to tell it how to highlight code blocks via passing in a custom method)

Quick example of what happens now

Example Input

Note1 Note2 IndentedNote

Current HTML2Text output. IndentedNote is considered a code block

Note1

Note2

IndentedNote

Desired HTML2Text output. IndentedNote is soft breaked to the next line

Note1

Note2   IndentedNote

Thinking about it now, it may also be possible to going back to softbreaks only, and fixing the formatting issues there instead. They mostly occurred when the source document didn't have a new line between the end of a list, and the next p tag. So maybe if we force a hard break at the end of every list, that might work? This ~might~ be the easiest solution if we decide to go forward with resolving this ticket, but needs further investigation and testing.

Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/356#issuecomment-37766331 .

JHilker commented 10 years ago

Parens inside code blocks are not unescaped. This leads to output such as (something).

JHilker commented 10 years ago

Also, a paragraph that is tabbed in will result in a code block, this is a common use case, and makes this issue much more important to fix I think.

AndrewMagliozzi commented 10 years ago

good catch @charlesconnell

On Tue, Mar 18, 2014 at 5:19 PM, Jacob Hilker notifications@github.comwrote:

Also, a paragraph that is tabbed in will result in a code block, this is a common use case, and makes this issue much more important to fix I think.

Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/356#issuecomment-37989863 .

btbonval commented 10 years ago

@AndrewMagliozzi With @JHilker doing other work, do we know where this ticket ended up?

Does this ticket represent the current state of dealing with space delimited formats?

AndrewMagliozzi commented 10 years ago

I believe this is still an open issue.

On Jun 12, 2014, at 11:51 PM, Bryan Bonvallet notifications@github.com wrote:

@AndrewMagliozzi With @JHilker doing other work, do we know where this ticket ended up?

Does this ticket represent the current state of dealing with space delimited formats?

— Reply to this email directly or view it on GitHub.

btbonval commented 9 years ago

This issue seems related to #227 , which was closed.

With the move to HTML as a primary source in #272 , the markdown engine might end up resolving or at least changing some of the problems noted in this issue.

I think we can close this and raise a new issue if a related problem arises with the changes being made.

btbonval commented 9 years ago

Markdown is no longer primary, but vestigial. There might be html2text issues that remain, but they will no longer be related to markdown. HTML is the primary note storage format now.