Closed echoface closed 6 years ago
Hi! Thank you for taking the time to submit the issue!
You're right, the output looks horrible. I never tried to get content from sites with other types of encoding than Latin.
I will take a look these days and see what I can do. Probably the best fix would be for you to manually specify the encoding from command line - this option is not implemented yet.
Hi croqaz,
Thank you for your prompt reply, I manually analyze the source of those site page, found that it specify the encode charset in the html header. so could we detect the encoding charset from the html source header field? if not found the charset header, then fallback use utf-8
<meta name="description" content="">
<meta charset="gb2312"/>
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1"/>
for my use case, the command line options may not help, before i review the content one by one , i don't known which content with incorrect encoding, I'm sorry I didn't help, I'm not familiar with node env and javascript stack.
Thank you for handling my feedback in time
Hey, you're right, the encoding is in the header! :) In that case it should be simple to use that, it's not implemented yet but I'll implement it. I'm kind of busy these days, but I'll fix this, hopefully these days.
This might be harder than I thought 😅
I found a library that helps with converting to gb2312, but I need more time.
What I found that might work is, after you open the badly encoded text with Atom editor
and you select encoding Chinese GBK
, the text is converted into something that might be the original ?? I don't know the language and I tried to translate it with Google Translate but it doesn't make much sense.
title: 锟斤拷锟酵达拷芫郑锟斤拷锟绞�锟斤拷锟竭筹拷锟斤拷一锟斤拷锟斤拷锟斤拷锟斤拷思锟斤拷锟斤拷锟斤拷_锟斤拷锟斤拷锟斤拷_锟斤拷锟斤拷锟斤拷
description: 锟届极一时锟侥凤拷锟斤拷锟藉、锟斤拷锟斤拷堑氐姆锟斤拷凸锟芥,锟斤拷锟阶撅拷锟斤拷锟矫的凤拷锟斤拷锟斤拷锟斤拷锟角碉拷锟教斤拷目锟斤拷锟斤拷樱锟饺达拷锟绞�锟斤拷锟竭筹拷一锟斤拷锟斤拷锟斤拷锟斤拷思锟斤拷锟斤拷锟竭★拷
Something like that. Does that make any sense ??
sorry, it's still in a mess, i tried "curl http://news.iresearch.cn/content/2018/05/274436.shtml | iconv -f gb2312 -t utf-8", this command could make the correct content,
@HuanGong I believe it is fixed now, in v0.7. Please give it a try and let me know, because I don't understand the language.
Thanks for your magic job, yes! it works now.
Thanks.
Wheee 🎉
Thank for your magic code, it help me solve some 'most hard' problem,
but i got a wrong result with bad content, it seems a 'encode' issue for some url
for example: http://start.iresearch.cn/content/2018/03/273475.shtml
output: ���й��г���ʼ�������������Ǹ���עƷ�ʵ�ʱ��ȴʧȥ���Լ���Ʒ�ʼ��ء��ڵ���ƽ̨����ս�����������ʱ��������������ȫƷ��֮ʱ������ȴ�����յġ��Ի͡��У�����Ű����Σ���ʧ��չ��������Ʒ�ơ���������������ƽ̨�������Լ�Ҳ˵������� .....
could you help me found the root cause of it and fix it?