Open ni-nkozlowski opened 3 years ago
AFAIK these utf8 conversion problems are all fixed already. We added some extensive tests to ensure files with unicode characters can be loaded. The folder ni-media/audiostream/test/test_files/user_files
contains all sorts of files with unicode characters... and the unit tests loads these files as expected
'L1b_mp3-123456789äöü□'$'\204''□'$'\226''□'$'\234''!@#$%^&_(□'$'\203\225''□'$'\202''□k□'$'\202''□□'$'\201\214'')-ah7Hfds.mp3'
'L1m4a-123456789äöü□'$'\204''□'$'\226''□'$'\234''!@#$%^&_(□'$'\203\225''□'$'\202''□k□'$'\202''□□'$'\201\214'').m4a'
'unicode123456789äöü□'$'\204''□'$'\226''□'$'\234''□'$'\237''!@#$%^&_(□'$'\203\225''□'$'\202''□k□'$'\202''□□'$'\201\214'')-ah7Hfds.aac.m4a'
'unicode123456789äöü□'$'\204''□'$'\226''□'$'\234''□'$'\237''!@#$%^&_(□'$'\203\225''□'$'\202''□k□'$'\202''□□'$'\201\214'')-ah7Hfds.aiff'
'unicode123456789äöü□'$'\204''□'$'\226''□'$'\234''□'$'\237''!@#$%^&_(□'$'\203\225''□'$'\202''□k□'$'\202''□□'$'\201\214'')-ah7Hfds.alac.m4a'
'unicode123456789äöü□'$'\204''□'$'\226''□'$'\234''□'$'\237''!@#$%^&_(□'$'\203\225''□'$'\202''□k□'$'\202''□□'$'\201\214'')-ah7Hfds.flac'
'unicode123456789äöü□'$'\204''□'$'\226''□'$'\234''□'$'\237''!@#$%^&_(□'$'\203\225''□'$'\202''□k□'$'\202''□□'$'\201\214'')-ah7Hfds.mp3'
'unicode123456789äöü□'$'\204''□'$'\226''□'$'\234''□'$'\237''!@#$%^&_(□'$'\203\225''□'$'\202''□k□'$'\202''□□'$'\201\214'')-ah7Hfds.ogg'
'unicode123456789äöü□'$'\204''□'$'\226''□'$'\234''□'$'\237''!@#$%^&_(□'$'\203\225''□'$'\202''□k□'$'\202''□□'$'\201\214'')-ah7Hfds.wav'
Perhaps the call site is missing the proper locale initialization? See https://github.com/NativeInstruments/ni-media/blob/master/audiostream/test/ni/media/test_helper.cpp#L46. This is needed to properly interface with windows APIs (which use utf16 encoded wchar_t). Deep down in boost::iostreams, the sources / sinks will convert char <-> wchar_t and expect the proper conversion to take place.
I am not sure switching to std::filesystem::path
will change anything to that behavior.
Maybe we should document this as a requirement on Windows? Or maybe we should use our own facet inside of ni-media, but I don't know if this even works, and if it's a good idea. Tbh I lost a bit track about the current state of utf8 in cpp, and what's best practice. Actually, I just noticed that https://en.cppreference.com/w/cpp/locale/codecvt_utf8_utf16 is already deprecated in C++17... that didn't last long :)
CC @C4rsten
Perhaps the call site is missing the proper locale initialization?
Yes. This is how we fixed it on the user side. If we use our own facet, it's one less detail the user has to worry about, so I would be up for that. Though it seems like we would have to roll our own, as you pointed out std::codecvt
has already been deprecated. 🤷
There's a PR open right now related to this issue as well where a path
interface was added, but it also swaps out the backend from using boost::filesystem
to std::filesystem
.
Yes I agree, this is a hidden requirement and totally not obvious for clients. I would treat this issue separately from the C++ 17 std::filesystem support though.
I would suggest the following approach:
Heyhey, thanks for looking into this!
I would like to help execute this but I'm not yet 100% clear on how these steps would look like exactly.
For Step 1 you're envisioning something like this? (really not sure about the formatting here)
class aiff_file_source : public aiff_source<boost::iostreams::file_descriptor_source>
{
using base_type = aiff_source<boost::iostreams::file_descriptor_source>;
#if BOOST_OS_WINDOWS
using converter = std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>>;
#endif
public:
explicit aiff_file_source( const std::string& path )
#if BOOST_OS_WINDOWS
: base_type( converter{}.from_bytes(path), BOOST_IOS::binary | BOOST_IOS::in )
{}
#else
: base_type( path, BOOST_IOS::binary | BOOST_IOS::in )
{}
#endif
};
Step 2 is then using fs::path internally:
class aiff_file_source : public aiff_source<boost::iostreams::file_descriptor_source>
{
using base_type = aiff_source<boost::iostreams::file_descriptor_source>;
public:
explicit aiff_file_source( const std::string& path )
: base_type( std::filesystem::u8path(path), BOOST_IOS::binary | BOOST_IOS::in )
{}
};
and Step 3 is then just using fs::path as the input param type as done in the PR but without using u8path
.
Is that somewhat along the lines of what you had in mind or am I completely going in the wrong direction?
As stated initially I would avoid string conversions entirely. Instead on Windows use std::wstring
and deprecate the std::string
interface to not break compatibility. I agree with the following steps: adding std::filesystem
support eventually.
On Windows, conversions between
std::string
andpath
interfaces results in UTF-8 strings being converted to ANSI, resulting in non-ANSI characters being lost in the conversion process. This results in files not being loaded correctly because they'll fail our valid path checks.