chaodreaming / Simple-LaTeX-OCR

Large scale training of Latex formula recognition model, currently being organized and open source
Apache License 2.0
17 stars 1 forks source link

LaTeX without style components #3

Open hovkaren opened 4 weeks ago

hovkaren commented 4 weeks ago

Hi @chaodreaming Thank you for this repository.

Is possible make API to predict image to without LaTeX style components

for example now from this input image :

math

output LaTeX is :

(\mathfrak{a}+\mathfrak{b})^{2}=\mathfrak{a}^{2}+2\mathfrak{a}\mathfrak{b}+\mathfrak{b}^{2}

without style components is :

({a}+{b})^{2}={a}^{2}+2{a}{b}+{b}^{2}

chaodreaming commented 3 weeks ago

This is caused by insufficient model generalization capability and the dataset is not clean enough, the generalization capability is currently being addressed, please contact me if there is a tool that can express the dataset in a uniform way

hovkaren commented 3 weeks ago

Thanks for answer @chaodreaming. I need to get only math formulas without style components, just variables and math operations and functions. As I understand we need to create new onnx models to get that results. Is there any documentation to do that?

chaodreaming commented 3 weeks ago

It is possible to remove all styles through regular expressions, but this issue is caused by insufficient generalization ability of the model, which leads to a lack of understanding of image features

hovkaren commented 3 weeks ago

Hi @chaodreaming. Thanks for answer. I think I will do that with regular expressions.