croqaz / clean-mark

Convert an article into clean text
MIT License
600 stars 51 forks source link

If the code of the article has Chinese comments, weird encoding can appear #7

Open SilenceZhou opened 4 years ago

SilenceZhou commented 4 years ago

If the code of the article has Chinese comments, weird encoding can appear:

You try this url (chinese blog):

clean-mark "https://juejin.im/post/5e916011e51d4547153d15c7"

gaozhao7 commented 4 years ago

the same question,example: {@link 包名.类名#方法名(参数类型)} --> {@link 包名.类名#方法名(参数类型)}

croqaz commented 4 years ago

Hi guys, thank you for raising this issue! I implemented a feature about encoding, some time ago: https://github.com/croqaz/clean-mark/issues/2 In the case of the website you mention, the encoding cannot be detected from the meta charset.

I will implement a new command line flag, so you can manually specify the encoding, eg: --encoding gb2312. I will probably implement this in the next days.

kiyoakii commented 4 years ago

Hi guys, thank you for raising this issue! I implemented a feature about encoding, some time ago: #2 In the case of the website you mention, the encoding cannot be detected from the meta charset.

I will implement a new command line flag, so you can manually specify the encoding, eg: --encoding gb2312. I will probably implement this in the next days.

That would be great for Chinese users, for I just met the same issue. Thank you very much.

croqaz commented 4 years ago

Guys, I didn't have time to look at this issue too deeply, sorry about that. But I did find something and there's good news and bad news. The good news is encoding works correctly in the HTML all the way. The bad news is that breakdance library, that converts the HTML into Markdown, breaks the encoding in case of code blocks.

You can actually check this on your own like this:

clean-mark 'https://juejin.im/post/5e916011e51d4547153d15c7' -t html

You'll see that the HTML is correct. At least it looks to me, but I don't understand the language... So I'll look into this more and see if there's anything I can do.

The worst case scenario, I have to look at alternative libraries to convert the HTML into Markdown. If there are any...

kiyoakii commented 4 years ago

I have just checked the HTML generated by the above instruction, and it is correct. Thank you for doing this for us and hopefully it will be solved one day.

croqaz commented 3 years ago

Hi guys, I believe I fixed the issue in the latest commit. I replaced "breakdance" with "turndown" to convert the HTML into Markdown and it works much better. I didn't make a release yet, because the tests are still broken, but if you can clone the repo and check a few websites, it would be amazing, I'm thinking to add a few pages in the tests too, just to make sure the app will always work. Would you mind giving me a 2-3 links to articles that you think are more important?

codeth99 commented 3 years ago

Thanks!Thanks!Thanks!I have cloned the repo and checked a few websites, it normally works.Such as : https://blog.csdn.net/weixin_33743248/article/details/88733044 😄

However,in this article(https://blog.csdn.net/NextStand/article/details/59535555) ,some comments of the code like“//输出 test.js” will be losed