BitFunnel / mg4j-workbench

Java tools for evaluating BitFunnel performance compared to an mg4j baseline.
GNU Lesser General Public License v3.0
1 stars 2 forks source link

utf-8 to utf-16 conversion in ChunkWordReader.next() is incorrect. #33

Open MikeHopcroft opened 7 years ago

MikeHopcroft commented 7 years ago

This code casts each byte to char, ignoring all multi-byte characters. The only reason this works is that the gov2 corpus is mostly ASCII.