Borewit / listFix

listFix() - Playlist Repair Done Right
GNU General Public License v3.0
13 stars 0 forks source link

Find the closest matches does not find the closest matches #120

Open touwys opened 1 year ago

touwys commented 1 year ago

This is an issue observed throughout testing:

The issue here is that the matched tracks, which are found by listFix(), do not even remotely match, or resemble, the original track. It should be noted that this occurs too often, but not in every instance.

The main thrust of the issue, however, is that these total mismatches occur while there are numerous, almost identical, copies known to exist in the Media Directory.

Also refer to the discussion here.


touwys commented 1 year ago

@Borewit:

Should this issue rather be posted at the current PR-release on test?

Borewit commented 1 year ago

No, I suspect it is not introduced by that PR, so better to capture like this.

touwys commented 1 year ago

No, I suspect it is not introduced by that PR, so better to capture like this.

Yes, I'll leave it here, as I have noticed the issue all along while testing the other PR's as well.

touwys commented 1 year ago

@Borewit:

Operationally, the ultimate measure of success of listFix() is going to be determined by how effective it is in finding the closest matches for mismatched playlist tracks. Speed and accuracy are the two essential ingredients to this. At the moment, speed is sufficient, but accuracy lacks — by a wide margin. Since accuracy is crucial to successful outcomes, I propose that this issue is put up next for a fix.

Borewit commented 1 year ago

You do the ground work for this one @touwys, good example where the matching algorithm currently flaws (does not pick the best result) are very useful.

I have to idea's to improve the matching algorithm:

  1. I prefer the track from the same folder, if a reasonable match is found in the same folder
  2. As the parent folder(s) name(s) could represent the title, or artist, I think we could take those into account
touwys commented 1 year ago

You do the ground work for this one @touwys, good example where the matching algorithm currently flaws (does not pick the best result) are very useful.

Yes, but whether I'm up to the task is quite another matter.  It's like traversing the proverbial labyrinth. A useful start for me would be if you could tap into the existing algorithm, and "translate" its current train of reasoning for me. Once I have that as the base, I can then measure, and build upon it. How simple, or complex, do we want this to be?


20 Apr 2023 21:04:59 Borewit @.***>:

You do the ground work for this one @touwys[https://github.com/touwys], good example where the matching algorithm currently flaws (does not pick the best result) are very useful.

I have to idea's to improve the matching algorithm:

  1. I prefer the track from the same folder, if a reasonable match is found in the same folder
  2. As the parent folder(s) name(s) could represent the title, or artist, I think we could take those into account

— Reply to this email directly, view it on GitHub[https://github.com/Borewit/listFix/issues/120#issuecomment-1516809821], or unsubscribe[https://github.com/notifications/unsubscribe-auth/APVPQQUQXZJ3AZCQO5JHONLXCGCFXANCNFSM6AAAAAAVJSDJEQ]. You are receiving this because you were mentioned.[Tracking image][https://github.com/notifications/beacon/APVPQQXTYM7BVIONFGJRCELXCGCFXA5CNFSM6AAAAAAVJSDJESWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS2NCXF2.gif]

touwys commented 1 year ago

@Borewit, on second thoughts, we have to pay this issue long, and careful attention, because there is more to it than casually meets the eye. The supremacy of listFix() as an app, stands completely on the quality of its search for matching tracks — how closely the search results match the originals of a fractured playlist. Now, while it probably constitutes a challenge bigger than it's worth considering, don't you think that we should replace the current search model with one deploying the music file tags? Searching the tags provides many more options with which to improve the accuracy of the algorithm.

Borewit commented 1 year ago

don't you think that we should replace the current search model with one deploying the music file tags?

That idea crossed my mind a few times. Reading metadata, or alternatively using acoustic finger prints are way better technologies to identify audio tracks, then just the filename, which could be as meaningless as "track 01". However, this information is not available in the playlist, and if the track is missing there is nothing to read from the original track. The only thing we have is the filename.

Borewit commented 1 year ago

A useful start for me would be if you could tap into the existing algorithm, and "translate" its current train of reasoning for me.

Based on: https://github.com/Borewit/listFix/pull/106#issuecomment-1452280802

The "closest match" is purely based on the filename portion of the audio track, so excluding the parent folder.

https://github.com/Borewit/listFix/blob/441fa4e2b8a3c5b8510b95fbfa291897310cf21d/src/main/java/listfix/util/FileNameTokenizer.java#L26-L29

This score function is chopping the file name into words, e.g.: "01 Madonna - Like a Prayer.mp3" becomes something like: ["01", "Madonna" , "Like", "Prayer"].

scoreMatchingTokens function is then comparing these words, in each track in your library, also converted to a similar list of words. Then a score is basically calculated comparing those sets of words.

https://github.com/Borewit/listFix/blob/a452d74c0cab53ef7f3ea2a42b3185c3e84b59d4/src/main/java/listfix/util/FileNameTokenizer.java#L81

Based on that score the matches are sorted, and the highest scored matches are kept.

touwys commented 1 year ago

The "closest match" is purely based on the filename portion of the audio track, so excluding the parent folder.

Thank you. I was in the midlle of a lengthy reply to your previous post, when this one arrived. I can cut it to the following:

  1. Would it not be natural to take the parent folder into account?

  2. Since the music metadata is not available, what, if any, other (hopefully useful) data is getting saved along with the filename? Surely, the file creation date, modification date, and, especially, the file size, and such, are also getting saved? (Apart from the actual file content, how else is file-synchronisation achieved?) The question is, if, and how, listFix() can also make use of these during its search to find the closest matches? If, for instance, the file size is also available to us, I think it can be a most useful parameter when comparing files to find a perfect, or close, match.

  3. Another point to consider for improved accuracy, especially in as far as media libraries may contain mixed music file formats (for e.g. both FLAC & MP3), is to restrict the search to the orginal format. This can probably be achieved by an optional setting. More on this, later.

•••


29 Apr 2023 10:19:14 Borewit @.***>:

A useful start for me would be if you could tap into the existing algorithm, and "translate" its current train of reasoning for me.

Based on: #106 (comment)[https://github.com/Borewit/listFix/pull/106#issuecomment-1452280802]

The "closest match" is purely based on the filename portion of the audio track, so excluding the parent folder.

https://github.com/Borewit/listFix/blob/441fa4e2b8a3c5b8510b95fbfa291897310cf21d/src/main/java/listfix/util/FileNameTokenizer.java#L26-L29

This score function is chopping the file name into words, e.g.: "01 Madonna - Like a Prayer.mp3" becomes something like: ["01", "Madonna" , "Like", "Prayer"].

scoreMatchingTokens function is then comparing these words, in each track in your library, also converted to a similar list of words. Then a score is basically calculated comparing those sets of words.

https://github.com/Borewit/listFix/blob/a452d74c0cab53ef7f3ea2a42b3185c3e84b59d4/src/main/java/listfix/util/FileNameTokenizer.java#L81

Based on that score the matches are sorted, and the highest scored matches are kept.

— Reply to this email directly, view it on GitHub[https://github.com/Borewit/listFix/issues/120#issuecomment-1528715018], or unsubscribe[https://github.com/notifications/unsubscribe-auth/APVPQQTYC5H7MGUGCTCIAN3XDTFIFANCNFSM6AAAAAAVJSDJEQ]. You are receiving this because you were mentioned.[Tracking image][https://github.com/notifications/beacon/APVPQQSMKB4FGJ2B4HSIJPDXDTFIFA5CNFSM6AAAAAAVJSDJESWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS3DZLQU.gif]

Borewit commented 1 year ago

The point is, there is no original file to compare with @touwys.

Let assume you have the following M3U playlist:

#M3U
C:\Users\Borewit\Music\Rodriguez - Rich Folks Hoax.flac

Yet that file does not exist at that location, since I moved to: C:\Users\Borewit\Music\Rodriguez\1970 - Cold Fact\Rich Folks Hoax.flac

So we only have the path, no size, no tags, not anything. Neither File size is reliable indicator, as this is strongly related to the encoding of an audio file. You may want to restore a playlist going replacing mp3 tracks with FLAC.

touwys commented 1 year ago

Thanks, thus I stand corrected.


29 Apr 2023 11:48:21 Borewit @.***>:

The point is, there is no original file to compare with @touwys[https://github.com/touwys].

Let assume you have the following M3U playlist:

#M3U C:\Users\Borewit\Music\Rodriguez - Rich Folks Hoax.flac Yet that file does not exist at that location, since I moved to: C:\Users\Borewit\Music\Rodriguez\1970 - Cold Fact\Rich Folks Hoax.flac

So we only have the path, no size, no tags, not anything. Neither File size is also not a reliable indicator, as this is strongly related to the encoding of an audio file. You may want to restore a playlist going replacing mp3 tracks with FLAC.

— Reply to this email directly, view it on GitHub[https://github.com/Borewit/listFix/issues/120#issuecomment-1528738053], or unsubscribe[https://github.com/notifications/unsubscribe-auth/APVPQQTCMARROKSBIO6HW33XDTPWJANCNFSM6AAAAAAVJSDJEQ]. You are receiving this because you were mentioned.[Tracking image][https://github.com/notifications/beacon/APVPQQRX2IDCGOFS36X6W5TXDTPWJA5CNFSM6AAAAAAVJSDJESWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS3D2YQK.gif]

Borewit commented 1 year ago

Thanks, thus I stand corrected.

Then I probably don't understand point 2 of https://github.com/Borewit/listFix/issues/120#issuecomment-1528730764.

touwys commented 1 year ago

Thanks, thus I stand corrected.

Then I probably don't understand point 2 of #120 (comment).

Your reply was spot-on. It is I who am moving on very unfamiliar terrain as far as untangling the intricacies of Windows, and other software operations and their interconnectedness, are concerned.

I took note that there is literally nothing else to work with, than the basic filename. If the horse is dead already, how many ways are left to saddle it? The salient question is, is it yet possible to improve upon the quality of the listFix() search results? The restrictions laid on by the file name ("what" to search for), don't apply to the method of search ("how" to search), and this could be the more fruitful avenue of investigation.

•••