Open SidneyLann opened 6 months ago
public static void utf8ToGbk() throws Exception { String fileName = "c:/tokenizer.json"; List lines = Files.readAllLines(Paths.get(fileName), Charset.forName("utf-8")); String sentence = null; int size = lines.size(); for (int i = 0; i < size; i++) { sentence = lines.get(i); //System.out.println(sentence); System.out.println(new String(sentence.getBytes("GBK"))); } }
这样也看不到中文,该怎么操作才能看到词汇表里的中文token?
这个不是这样看的
文本编辑器已设为utf-8也看不到,怎样才能看到呢?
这个不是这样看的 文本编辑器已设为utf-8也看不到,怎样才能看到呢?
我建议读一下llama3 的tokenizer的方式。里面应该没有办法直接读取到中文。中文都被拆解开了。
llama3代码很少,看不出怎么读中文,怎么训练?
public static void utf8ToGbk() throws Exception { String fileName = "c:/tokenizer.json"; List lines = Files.readAllLines(Paths.get(fileName), Charset.forName("utf-8"));
String sentence = null;
int size = lines.size();
for (int i = 0; i < size; i++) {
sentence = lines.get(i);
//System.out.println(sentence);
System.out.println(new String(sentence.getBytes("GBK")));
}
}
这样也看不到中文,该怎么操作才能看到词汇表里的中文token?