Cobertos / md2notion

A better Notion.so Markdown importer
MIT License
654 stars 65 forks source link

Math equation is broken #31

Open shizidushu opened 3 years ago

shizidushu commented 3 years ago
### Linear Models and Least Squares

Given a vector of inputs $X^T=(X_1, X_2, \ldots, X_p)$, we predict output $Y$ via the model
$$
\hat{Y} = \hat{\beta}_0 + \sum_{j=1}^p X_j \hat{\beta}_j
$$
The term $\hat{\beta}_0$ is the intercept, also known as the *bias* in machine learning. Often it is convenient to include the constant variable 1 in $X$, include $\hat{\beta_0}$ in the vector of coefficients $\hat{\beta}$, and then write the linear model in vector form as an inner product
$$
\hat{Y} = X^T \hat{\beta}
$$
where $X^T$ denotes vector or matrix transpose ($X$ being a column vector). Here we are modeling a single output, so $\hat{Y}$ is a scalar; in general $\hat{Y}$ can be a $K$-vector, in which case $\beta$ would be a $p \times K$ matrix of coefficients. In the $(p+1)$-dimensional input-output space, $(X, \hat{Y})$ represents a hyperplane. If the constant is included in $X$, then the hyperplane includes the origin and is a subspace; if not; it is an affine set cutting the $Y$-axis at the point $(0, \hat{\beta}_0)$. From now on we assume that the intercept is included in $\hat{\beta}$.

In typora: 图片

with open('temp.md', "r", encoding="utf-8") as mdFile:
    newPage = page.children.add_new(PageBlock, title=mdFile.name)

    txt = mdFile.read()
    txt_list = re.split(pattern, txt)
    for i, string in enumerate(txt_list):
        if string == '':
            txt_list[i] = '\n'
    new_txt = ''.join(txt_list)

    rendered = convert(new_txt,addLatexExtension(NotionPyRenderer))
    for blockDescriptor in rendered:
        uploadBlock(blockDescriptor, newPage, mdFile.name)

The equation is broken 图片

Cobertos commented 3 years ago

Hmm, it looks like the _0 ... m_ in the equation seems to have been interpreted as Markdown italics by notion-py?

That's the only difference I see between your equation and the below:

\hat{Y} = \hat{\beta}_0 + \sum_{j=1}^p X_j \hat{\beta}_j

Cobertos commented 3 years ago

This line should be setting title_plaintext like CodeBlock does, instead of title. That should fix it

shizidushu commented 3 years ago

This line should be setting title_plaintext like CodeBlock does, instead of title. That should fix it

@Cobertos Thanks. I get it works.


from mistletoe.block_token import BlockToken
from mistletoe.html_renderer import HTMLRenderer
from mistletoe import span_token
from mistletoe.block_token import tokenize

from md2notion.NotionPyRenderer import NotionPyRenderer

from notion.block import EquationBlock, field_map

class CustomEquationBlock(EquationBlock):

    latex = field_map(
        ["properties", "title_plaintext"],
        python_to_api=lambda x: [[x]],
        api_to_python=lambda x: x[0][0],
    )

    _type = "equation"

class CustomNotionPyRenderer(NotionPyRenderer):

    def render_block_equation(self, token):
        def blockFunc(blockStr):
            return {
                'type': CustomEquationBlock,
                'title_plaintext': blockStr #.replace('\\', '\\\\')
            }
        return self.renderMultipleToStringAndCombine(token.children, blockFunc)

import re
pattern = re.compile(r'( {0,3})((?:\$){2,}) *(\S*)')

class Document(BlockToken):
    def __init__(self, lines):
        if isinstance(lines, str):
            lines = lines.splitlines(keepends=True)
        else:
            txt = lines.read()
            txt_list = re.split(pattern, txt)
            for i, string in enumerate(txt_list):
                if string == '':
                    txt_list[i] = '\n'
            lines = ''.join(txt_list)
            lines = lines.splitlines(keepends=True)
        lines = [line if line.endswith('\n') else '{}\n'.format(line) for line in lines]
        self.footnotes = {}
        global _root_node
        _root_node = self
        span_token._root_node = self
        self.children = tokenize(lines)
        span_token._root_node = None
        _root_node = None

def markdown(iterable, renderer=HTMLRenderer):
    """
    Output HTML with default settings.
    Enables inline and block-level HTML tags.
    """
    with renderer() as renderer:
        return renderer.render(Document(iterable))

def convert(mdFile, notionPyRendererCls=NotionPyRenderer):
    """
    Converts a mdFile into an array of NotionBlock descriptors
    @param {file|string} mdFile The file handle to a markdown file, or a markdown string
    @param {NotionPyRenderer} notionPyRendererCls Class inheritting from the renderer
    incase you want to render the Markdown => Notion.so differently
    """
    return markdown(mdFile, notionPyRendererCls)
shizidushu commented 3 years ago

The InlineEquation has the same problem. @Cobertos Can you have a look?

shizidushu commented 3 years ago

I comment this line https://github.com/miyuchina/mistletoe/blob/2cfe7446b975685f98837f9e40aaabcc0e270a79/mistletoe/core_tokens.py#L63 Then the InlineEquation works.

Cobertos commented 3 years ago

I'll leave it open until the fix gets in the library itself. Will need to do that soon.

As for the inline equations, notion-py is the one that actually handles uploading inline equations to Notion, added in this PR. This is because it does some special conversions to convert to Notion's expected format.

Looking at that PR, it looks like notion-py's inline equations are formatted with double '$$'s, not single? Which seems to differ from your example, not sure if that is working for you?

In your case though, in md2notion, emphasis is handled by re-echoing out the specific markdown as notion-py will handle that later. That's going to cause issues in your case, converting _ to '*'. I will look into seeing if there's a way mistletoe will allow the exact emphasis formatting marker to carry over. That should at least preserve your _ to let notion-py handle the rest.

shizidushu commented 3 years ago

There is no problem related to the single $, it has been handled well somewhere.

There is another problem that worth metioning is that if there is no blank line before the block equation, the block equation will be treated as part of TextBlock. I add \n before and after the double $$ and then trim the equation block string to avoid.

import itertools
new_lines = []
for (i, line) in enumerate(lines):
    new_line = [None, line, None]
    if i > 0 and i < len(lines) - 2:
        if line == '$$\n' and lines[i-1][0] != '\n':
            new_line[0] = '\n'
        if line == '$$\n' and lines[i+1][0] != '\n':
            new_line[2] = '\n'
    new_lines.append(new_line)
new_lines = list(itertools.chain(*new_lines))
new_lines = list(filter(lambda x: x is not None, new_lines))
new_lines = ''.join(new_lines)
lines = new_lines.splitlines(keepends=True)
lines = [line if line.endswith('\n') else '{}\n'.format(line) for line in lines]

Hope it will be handled well and may be more intelligently by the package too.

Cobertos commented 3 years ago

I'd have to play with it more, but I think requiring 2 line breaks between blocks is part of Markdown itself.

On Tue, Mar 16, 2021, 7:38 AM shizidushu @.***> wrote:

There is no problem related to the single $, it has been handled well somewhere.

There is another problem that worth metioning is that if there is no blank line before the block equation, the block equation will be treated as part of TextBlock. I add \n before and after the double $$ and then trim the equation block string to avoid.

import itertoolsnew_lines = []for (i, line) in enumerate(lines): new_line = [None, line, None] if i > 0 and i < len(lines) - 2: if line == '$$\n' and lines[i-1][0] != '\n': new_line[0] = '\n' if line == '$$\n' and lines[i+1][0] != '\n': new_line[2] = '\n' new_lines.append(new_line)new_lines = list(itertools.chain(*new_lines))new_lines = list(filter(lambda x: x is not None, new_lines))new_lines = ''.join(new_lines)lines = new_lines.splitlines(keepends=True)lines = [line if line.endswith('\n') else '{}\n'.format(line) for line in lines]

Hope it will be handled well and may be more intelligently by the package too.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/Cobertos/md2notion/issues/31#issuecomment-800185331, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTSGCBQ4MBJVJA4NBP2WULTD47MPANCNFSM4ZCTPFZA .

Cobertos commented 3 years ago

title_plaintext is now added to master. I also added two tests. Still need to push a package update

To answer all the fixes/questions related to equation blocks current state

The InlineEquation has the same problem.

I added a test that now tests for this. What gets passed to notion-py should be well-formed. Looks like notion-py is parsing the inline Markdown again, so that is most likely where the issue arises.

I don't see an easy fix for notion-py on this though...

There is no problem related to the single $, it has been handled well somewhere.

Woops, yes, I was mistaken. This works correctly. Single $'s are converted to $$ in output

There is another problem that worth metioning is that if there is no blank line before the block equation, the block equation will be treated as part of TextBlock.

Hmm, I am seeing this issue. Ideally we would support this sort of case because it's similar to how CommonMark's specification describes code fences. "A fenced code block may interrupt a paragraph, and does not require a blank line either before or after."

After some research, the issue lies with how mistletoes Paragraph block read() function works. It will specifically loos for CodeFence.start() to break out of it's read() loop. We would need to edit Paragraphs read() function to add BlockEquation.start() in there too to fix this.

Cobertos commented 3 years ago

Upstream tag for the inline equation issue. Open to ideas to fix the newline thing,,, can't think of an easy way to integrate that