TheDiscDb / data

Source data for thediscdb
MIT License
21 stars 8 forks source link

ImportBuddy Log Cleanup removes important information #27

Open ociaw opened 1 week ago

ociaw commented 1 week ago

I understand that the data for TheDiscDB isn't necessarily intended to be consumed by a 3rd-party. However, informational messages are being stripped from the MakeMKV logs that I believe are important to keep. Specifically:

MSG:1005 ex. MSG:1005:MakeMKV v1.16.4 win(x86-release) started This message contains the version number of MakeMKV. Helpful, but not essential.

MSG:3025, MSG:3016 are helpful to see if any titles were skipped, and therefore can determine that an import is incomplete.

MSG:3040 and MSG:3028, ex. MSG:3040,0,2,"Angle #4 was added for title #16","Angle #%1 was added for title #%2","4","16" MSG:3028,0,3,"Title #17/1 was added (1 cell(s), 0:05:37)","Title #%1 was added (%2 cell(s), %3)","17/1","1","0:05:37"

For DVDs, these messages are the only way to determine whether there are angles for a given title. Since MakeMKV can produce multiple DVD titles with identical Title Ids + Angle, we need to disambiguate those. This information is only available in the first argument of the MSG:3040 log entry. Otherwise, we have no way of knowing what which titles are supposed to be different angles of each other.

Similarly, MSG:3028 is lets us know which titles are sub-titles (?) with compound IDs. I'm not really sure what these are, TBH, but they do exist.

Disc 2 of Toy Story / The Ultimate Toy Box is a good example of a disc with lots of angles and compound IDs.

Unfortunately, these have been removed in https://github.com/TheDiscDb/data/commit/5629c462fcb4e32b836b750d6df7b022a3585629, and ImportBuddy strips these messages as well. Is it possible to have these messages restored? Or perhaps this data can be parsed from the logs and stored in the JSON file instead?

lfoust commented 1 week ago

I am certainly open to keeping those useful log lines in the MakeMKV logs. They are currently being removed to avoid storing file paths specific to the user's machine. If we can redact or just remove those specific lines which contain local paths that would work for me. Also, I do not have time to restore the lines removed in the commit you referenced above but I would be willing to reviiew/commit a PR to restore those. For the items imported since that commit, there is no way to recover the removed lines

ociaw commented 4 days ago

Great. I hadn't thought of the need to redact personal info - which log entries do you think might need to be scrubbed or redacted? Obviously MSG:1004, since it may reveal the user's home directory name.

DRV entries may also be problematic. These give us the drive name, disc name, and drive letter. Drive name and disc name can be useful, but we'd probably want to scrub drive letters/path. The drive name seems to sometimes include the serial number of the drive, which is probably not good? The disc name is probably ok, unless there's multiple drives - then there's potentially an issue where it'd expose the name of a different disc that isn't being ripped. I suppose that if the active drive can be determined then the other drives can be redacted.

MSG:2003 also may contain the name of the drive (and hence the serial number). MSG:3338 contains the user's home directory. MSG:3344 contains the location of the java runtime, but that probably isn't an issue?

Any other known problematic log entries?