PigSaint / GameDataBase

Detailed videogame database for sorting purposes in emulation
Other
40 stars 1 forks source link

Feature request: xxHash checksums #1

Open danmons opened 4 months ago

danmons commented 4 months ago

xxHash is a very fast, open source, portable (and already widely ported), stable Hash / checksum library and family of algorithms.

It is used in large scale production tools such as various file systems in the Linux kernel, data deduplication systems, etc to verify the integrity of very large sets of data worldwide (I rely on it daily in my profession for multi petabyte workloads). Many large companies that now require data integrity checking as part of their workflow accept and recommend xxHash as the preferred algorithm.

The project started life in 2012 and continues to see an active developer community today, with various algorithms within the larger library stabilised and standardised over time to be long-term (decades/centuries) usable.

Using the xxHash3 128bit "XXH128" algorithm, it provides similar (if not better) collision resistance than both md5sum and sha1sum, however does so at an order of magnitude faster performance (or better) even on low end hardware (with vectorisation enhancements to accelerate it even further on newer CPUs). Benchmarks and comparisons are provided in the URLs above.

While this is both "yet another checksum" to support and catalogue, and one that isn't certified cryptographic (acknowledging that both md5sum and sha1sum are now "broken", so neither are they cryptographic any longer), the speed benefits to using it for functions such as large scale ROM sorting, renaming, verifying, etc result in quantifiable time and energy/power savings. This is particularly evident as older checksums now are bottleneck as modern consumer disk performance exceeds their maximum throughput by orders of magnitude.

I would love to see any project that is attempting to put together comprehensive checksum collections begin to support this, particularly as digital collections grow in scale.

PigSaint commented 3 months ago

Hi, danmons!

Thank you for your request. I read it and I am interested because I love FOSS initiatives, but I have questions.

My knowledge about hashes is really limited (actually absolutely profane) and I added these four columns at the end of each line because a few developers requested it when I started the project, weeks before it became public. I'm not a software developer but a graphic designer, and I don't even know anything about why cryptographic is important in this context, but I want this project to be as useful (and free) as possible. I know what is this for, but as far as I can remember, I've never used hashes to check a file in my life. All I do to obtain these codes is visit WASM File Hash Online Calculator and get all these alphanumeric stuff by copy/paste with some text replacing cheap magic tricks. It's a completely empirical process for me. I don't really know what I'm doing; I just know this is useful for developers.

As you can see, 6000 lines have already made in GameDataBase. Now questions time. Do you know any way to add the xxHash hashes without entering them one by one at the end of each line? I'm talking about automation. Currently, I obtain the ROM pack and upload every file to that page (in bulk, of course), perform the process, and obtain a list with hashes at the end. Then, I add all the information game by game, setting a specific order which isn't alphabetical but regional. It differs from the ROM pack. Some are even duplicated in different lines because they are identical files but from different regions and packages. I can't imagine a method to automate the process of adding xxHash or any other, unless the hash codes (all the hashes already included, I mean) can assist. Is there any way to do that?

Thank you again for your request and sorry if my English is not correct.

wizzomafizzo commented 3 months ago

as long as you still have your reference game files, it will be possible to automate this process adding the extra hash. i think it's a good idea too. i don't think you need to stress about doing it immediately if you feel it's going to mess with your current workflow

PigSaint commented 3 months ago

As I always say, my intention is to fill these lists with the most complete, accurate and useful metadata as possible. There are a lot of things I didn't added (yet) because this would take a lot of extra time, but I want to work on it when most of the systems were complete. To be specific, I'm talking about serial numbers, age ratings, or even game staffs. For me, these are secondary. In addition, obviously we need quality graphic resources (logos, covers, etc). I usually think on IMDb website. Movie buffs are years ahead of us. I want to build solid foundations that allow us to grow in many directions, with a lot of different projects, with or without me. When I started this project I thought that the work was already done but it was decentralized, and in fact this is the case with a few systems, with some really decent Wikis full of page by page info. But as I have progressed I have realized that this was not the case at all. We don't even have a solid structure to build something big. And we need it.

With hashes we have to solve one important problem: disc games. It is really easy to add a hash code from a single ROM. Our community did a great job here for years, verifying every file and packaging them into increasingly reliable collections. You can download a ROM from ten different reliable sites and files are exactly the same, so the hashes from these files are exactly the same. No problem here. It is not the case for CD games, because there are a few correct ways to make a disc image, but hashes would be different. I think CHD files is the way, but you can find (or even create your own) CHD files, different from other CHD files, and still have correct files. If hashes are not useful in this case, perhaps I should skip them in this case until we collectively come to a conclusion. Yeah, this is the kind of things I keep in mind when I'm working on it. Hahaha. Always looking to the future.