Determining Encoding of an Archive

R0315 commented 1 year ago

I am hoping someone can either help me with this problem or maybe raise it as an issue. I'm working on a small tool to recursively extract nested archives until there are no more archives, but I'm running into issues with encoding. It may just be because I am a novice, but I see no way to determine the encoding with another library like Ude or something dynamically and then create a readable string out of it.

Right now, I'm seeing a number of zip file entries that convert with symbols and gibberish in their paths. It isn't many at all, but I'd much prefer to find a way to dynamically determine what encoding should be used.

I've seen other posts about this, such as #277, and I also see the comment it is resolved as of 0.18. But I can find no demonstration of how it is resolved.

I will share a snippet of my switch where I have a case to handle zip files so you can see how I am accessing the archive. I have tried specifying various encodings but really what I need is a way to dynamically determine the correct encoding or maybe for someone to help me understand something I currently do not and don't know it.

case ".zip":
    using (var archive = ZipArchive.Open(file.FullName))
    {
    foreach (var entry in archive.Entries)
    {
        // perform extraction of the entry
    }
}
break;

adamhathcock commented 1 year ago

Zip files have encoding in them but might not be reliable so it can be overridden. Something might need to change to follow spec

R0315 commented 1 year ago

Zip files have encoding in them but might not be reliable so it can be overridden. Something might need to change to follow spec

I've been looking into this problem a lot over the last several days and learned this. Apparently the spec says utf8 but it's not enforced so it would be whatever else. After thinking about it for a while I did cook up what could be a solution, and I would welcome and appreciate any feedback on this approach.

It seems to me the issue is that the system making the zip could use a different encoding than mine. Thinking about that, I wondered if it may be a good approach to iterate the archive entries, open the entry stream, detect the file character set, and log the count of how many times each one is detected across the archive entries. In the end, bump ASCII from the list. Check the most occurring one (as it seems reasonable to me to conclude this is the authors system encoding)--defaulting to UTF8 if one isn't determined. I welcome any feedback, especially if I am re-inventing the wheel, this is just a bad idea, or there is some better or established method. I am using Ude.

This is a snip of how I'm calling the method

// prep reader options in case of non-standard characters
ReaderOptions opts = new();
var encoding = Encoding.GetEncoding(CheckArchiveEncoding(file));
opts.ArchiveEncoding = new ArchiveEncoding()
{
    CustomDecoder = (data, x, y) =>
    {
        return encoding.GetString(data);
    }
};

and this is the method itself

// method for getting the archive encoding
private static int CheckArchiveEncoding(FileInfo file)
{
    // create dictionary to store detected encoding and count of encoding
    Dictionary<string, int> encodings = new();

    // create 1 kb buffer size
    int bufferSize = 1024;

    // open che archive
    using var archive = ArchiveFactory.Open(file.FullName);

    // loop over the entries
    foreach (var entry in archive.Entries)
    {
        // skip directories
        if (entry.IsDirectory) continue;

        // open the entry stream and get 1kb in bytes of the entry
        using var entryStream = entry.OpenEntryStream();
        byte[] buffer = new byte[bufferSize];
        entryStream.Read(buffer, 0, bufferSize);

        // open the 1kb and detect the charset
        using var bufferStream = new MemoryStream(buffer);
        CharsetDetector cdet = new();
        cdet.Feed(bufferStream);
        cdet.DataEnd();

        // if charset detected iterate the count or add to dict
        if (cdet.Charset != null)
        {
        if (encodings.ContainsKey(cdet.Charset))
            {
                encodings[cdet.Charset]++;
            }
            else
            {
            encodings.Add(cdet.Charset, 1);
            }
        }
    }

    // if ascii appears remove it
    // utf8 will be fallback
    if (encodings.ContainsKey("ASCII"))
        encodings.Remove("ASCII");

    // check for a most frequent encoding in the archive
    string mostFrequentEncoding = encodings.OrderByDescending(e => e.Value).FirstOrDefault().Key;

    // if a most frequest encoding is determined, return that coding to use
    // default to utf8
    if (!string.IsNullOrEmpty(mostFrequentEncoding))
    {
        return Encoding.GetEncoding(mostFrequentEncoding).CodePage;
    }
    else
    {
        return Encoding.UTF8.CodePage;
    }
}

DisIsAbhi commented 1 year ago

@R0315 how did you get CharsetDetector ? I have similar problem where i am trying to extract from Tar.tar in the TestArchives. if i dont provide the Archiveencoding as CP437, it is extracting with ???.txt as name. I will create an issue for that but just wanted to see if you have has success with the solution you mentioned above.

R0315 commented 1 year ago

@R0315 how did you get CharsetDetector ? I have similar problem where i am trying to extract from Tar.tar in the TestArchives. if i dont provide the Archiveencoding as CP437, it is extracting with ???.txt as name. I will create an issue for that but just wanted to see if you have has success with the solution you mentioned above.

I used Ude.NetStandard package.

I ended up taking a route with my project where if the program could tell that files within have encoding issues, it left them in the archive and, in the end, it generates a report that shows you what the issue was. I went and grabbed that archive you mentioned and, indeed, my code is unable to tell what the file encoding should be.

I remember having issues using sharp compress to try to scrape a KB of the file and test the encoding when working with tar files. I think the problem was I couldn't actually access the entry streams within to take a sample. However, in my use case, what I had was adequate to account for a lot of the issues I was trying to solve. So, leaving a few anomalous archives and just generating a report was more than satisfactory compared to all the time manually extracting that was being spent.

However, the heuristic approach I mentioned does work very well for any archive where you can access the entry streams and grab a chunk of bytes to test. If you can come up with a way to access the tar entry streams, then you might be able to try a the same approach.

DisIsAbhi commented 1 year ago

i have tried your approach and it seems to working to an extent. while it is not able to identify as cp437, it is no longer giving out ???.txt for the file name. I am also thinking may be i will read the filenames as well to see if that can be factored into decide the encoding. but right now reading the 1kb of binary is a good start i guess. Thanks

R0315 commented 1 year ago

i have tried your approach and it seems to working to an extent. while it is not able to identify as cp437, it is no longer giving out ???.txt for the file name. I am also thinking may be i will read the filenames as well to see if that can be factored into decide the encoding. but right now reading the 1kb of binary is a good start i guess. Thanks

Awesome, glad it's helping! Yes, I had a hard time with the encodings. I'm not sure if we'll really be able to iron out a 100% solution so long as the archiving software itself doesn't enforce a standard. As I recall, I learned that quite a few of them allow for just using whatever the system encoding of the user is--and that is the reason for this headache.

That said though, I'm sure what I did can be improved upon or even replaced with a better approach. If you get something going that works better please let me know! I'd be happy to apply it to my project too.

adamhathcock commented 1 year ago

Looks like entries can be unique charsets and the code is assuming that charset is the same for everything?

I've been out of it for a while

R0315 commented 1 year ago

Looks like entries can be unique charsets and the code is assuming that charset is the same for everything?

I've been out of it for a while

No worries! Yes, the code I shared is scraping the entry streams of the files and sampling them to see what their encoding is. It counts how many times each encoding appears in the archive. Once it's done, it assumes the one appearing the most is the one to use. I based this on assuming this would probably be the authors system encoding. If it can't determine one to use it defaults to UTF8. If that still would produce file names with problem characters, it leaves them packed up and logs the archive in a report produced at the end. (This project was for batch unpacking archives with nested archives).

It is working pretty well for my needs so far.

adamhathcock / sharpcompress

Determining Encoding of an Archive #742