croqaz / clean-mark

Convert an article into clean text
MIT License
600 stars 51 forks source link

Encode Error #2

Closed echoface closed 6 years ago

echoface commented 6 years ago

Thank for your magic code, it help me solve some 'most hard' problem,

but i got a wrong result with bad content, it seems a 'encode' issue for some url

for example: http://start.iresearch.cn/content/2018/03/273475.shtml

output: ���й��г���ʼ�������������Ǹ���עƷ�ʵ�ʱ�򷲿�ȴʧȥ���Լ���Ʒ�ʼ��ء��ڵ���ƽ̨����ս�����������ʱ�򣬾��������׷�����ȫƷ��֮ʱ������ȴ�����յġ��Ի͡��У�����Ű����Σ���ʧ��չ��������Ʒ�ơ���������������ƽ̨�������Լ�Ҳ˵������� .....

could you help me found the root cause of it and fix it?

croqaz commented 6 years ago

Hi! Thank you for taking the time to submit the issue!

You're right, the output looks horrible. I never tried to get content from sites with other types of encoding than Latin.

I will take a look these days and see what I can do. Probably the best fix would be for you to manually specify the encoding from command line - this option is not implemented yet.

echoface commented 6 years ago

Hi croqaz,

Thank you for your prompt reply, I manually analyze the source of those site page, found that it specify the encode charset in the html header. so could we detect the encoding charset from the html source header field? if not found the charset header, then fallback use utf-8

<meta name="description" content="">
<meta charset="gb2312"/>
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1"/>
echoface commented 6 years ago

for my use case, the command line options may not help, before i review the content one by one , i don't known which content with incorrect encoding, I'm sorry I didn't help, I'm not familiar with node env and javascript stack.

Thank you for handling my feedback in time

croqaz commented 6 years ago

Hey, you're right, the encoding is in the header! :) In that case it should be simple to use that, it's not implemented yet but I'll implement it. I'm kind of busy these days, but I'll fix this, hopefully these days.

croqaz commented 6 years ago

This might be harder than I thought 😅 I found a library that helps with converting to gb2312, but I need more time. What I found that might work is, after you open the badly encoded text with Atom editor and you select encoding Chinese GBK, the text is converted into something that might be the original ?? I don't know the language and I tried to translate it with Google Translate but it doesn't make much sense.

title: 锟斤拷锟酵达拷芫郑锟斤拷锟绞�锟斤拷锟竭筹拷锟斤拷一锟斤拷锟斤拷锟斤拷锟斤拷思锟斤拷锟斤拷锟斤拷_锟斤拷锟斤拷锟斤拷_锟斤拷锟斤拷锟斤拷
description: 锟届极一时锟侥凤拷锟斤拷锟藉、锟斤拷锟斤拷堑氐姆锟斤拷凸锟芥,锟斤拷锟阶撅拷锟斤拷锟矫的凤拷锟斤拷锟斤拷锟斤拷锟角碉拷锟教斤拷目锟斤拷锟斤拷樱锟饺达拷锟绞�锟斤拷锟竭筹拷一锟斤拷锟斤拷锟斤拷锟斤拷思锟斤拷锟斤拷锟竭★拷

Something like that. Does that make any sense ??

echoface commented 6 years ago

sorry, it's still in a mess, i tried "curl http://news.iresearch.cn/content/2018/05/274436.shtml | iconv -f gb2312 -t utf-8", this command could make the correct content,

croqaz commented 6 years ago

@HuanGong I believe it is fixed now, in v0.7. Please give it a try and let me know, because I don't understand the language.

echoface commented 6 years ago

Thanks for your magic job, yes! it works now.

Thanks.

croqaz commented 6 years ago

Wheee 🎉