[BUG] Possible whitespace Issue with OnlineHash calculation for stats

Is there an existing issue for this?

[X] I have searched the existing issues and none matched.

Operating System

Windows 10

CPU

Intel

GPU

gefore 3080 ti

Storage

2TB SSD

Game Version

4.19.1

Game Mode

dance

Theme

default

Preabmle

As far as I can tell, OnlineHash is calculated by Steps::GetHash(). This function gets the ASCII string representation of the SM note data, and then calculates its crc32 checksums. The result is then assigned to OnlineHash.

I assume that the purpose of OnlineHash is to identify songs and charts so that score submissions are assigned to the correct versions of charts. Therefore, it is a requirement that two identical charts return the same OnlineHash value

Issue

It seems that m_sNoteDataCompressed can be populated in other ways than NoteDataUtil::GetSMNoteDataString. The result is that some different arrangements of whitespace characters occur in otherwise equivalents step patterns. For example a particular line could have either 0000\r\n or 0000\n. I've seen trailing whitespace show itself on some random lines also.

Because the OnlineHash is just the crc32 of this note data string, we're going to end up getting different hashes for equivalent charts.

Assuming this were fixed, there are other problems. OnlineHash uses only the step data, not the timing data.

Proposed Solution

Change all \r\n and \r to \n and strip all whitespace and comments for the note data string that gets hashed.

If we care a lot about reducing as many collisions as possible, then OnlineHash should also be updated to digest the timing data. In my opinion, the one exception would be Beat0Offset, since it does not degrade the fundamental nature of a chart.

Another minor concern is that crc32 hashes are not secure. However, there will never truly be a way to stop fake scores from being submitted, so this may be a moot point. But a high-bit hash algorithm will further reduce the chances of collisions. And it doesn't hurt to use a secure hash algorithm. Git commits weren't originally designed to be securely hashed, but it proved to be very useful. An ounce of prevention now could be worth a pound of cure in the future.

TeamRizu / OutFox