Closed bmcfee closed 6 years ago
all grumbling aside, yea, I thought we did an artist_id split as well ... guess that's on me.
first, scope of the problem: why not do a full 20k x 20k distance matrix, all splits aside? I'm curious to know what the full duplicate count is. same goes for artist IDs...
second, fixing it: what's the right corrective action here? I see three possibilities, in increasing correctness:
I think you're suggesting (1) @bmcfee? I don't know which I prefer. (3) sounds like a lot of work, and I think making a point of this at ISMIR is a fine lesson to share.
first, scope of the problem: why not do a full 20k x 20k distance matrix, all splits aside? I'm curious to know what the full duplicate count is. same goes for artist IDs...
My laptop only has 8GB ram :cry:
The number of unique artist ids in the metadata csv is around 7K though, so interpret that however you like. Since the number of train/test dupes occurs at a rate of around 3 in 5000 (0.06%), I'm not too concerned about it as a general problem, but it'd be nice to have it clean from the get go.
I think you're suggesting (1) @bmcfee? I don't know which I prefer. (3) sounds like a lot of work, and I think making a point of this at ISMIR is a fine lesson to share.
Indeed, I'd opt for 1. It'll take a little finessing in the splitter script, but I can try to hack that out tomorrow afternoon/evening.
Opt 2 would also be okay, but aesthetically unpleasing.
As for opt 3, I'm not sure. The dupes I'm seeing tend to have the same label patterns (though the confidences differ slightly).
okay, well, turns out #6 / #22 can be helpful for this, which is as good as it is bad.
That branch introduces two csvfiles of sample_key: md5 hash
maps.
Which means, at the VGGish level, you have 6 pairs / 12 samples that are in some sense affected.
This, of course, obviously only catches exact matches ... so the story could be worse for near-misses. I might be able to dig into this a little bit more later, but it's back to real work now with hackdays ending.
From that branch / when it comes back to master...
df1 = pd.read_csv('tests/data/checksums/openmic-2018-audio.csv', index_col=0)
df2 = pd.read_csv('tests/data/checksums/openmic-2018-vggish.csv', index_col=0)
print(len(df1.md5.unique()), len(df2.md5.unique()))
I'd vote for 1. And if that is not too difficult to do, also adding an artist conditioning in the stratified sampling seems quite important to me. Even at the expense of a bit less even annotation distribution.
I agree, artist conditional splits with label stratification would be ideal. (Duplicate tracks within artist are a nuisance, but they're extremely rare so I can live with it.) The problem is that the current tools don't support both simultaneously, and it's not (to me) obvious how to achieve that effect. I'll have some time to think about it this evening.
Alternately, we could brute-force it by rejection-sampling stratified splits until we get one with no cross-contamination. This would be slow and painful, but in principle should work as long as there's some feasible split out there.
When I was faced with the same problem I just considered groups of tracks from the same artist as one entity but with the length of the number of tracks and used the same method. But it may not work in your label stratification framework. Also, the 15000/5000 split might not work exactly with this method, more something like ~15000/~5000.
If brute force work, go for it. If only a bunch of artists are still in the train and test sets, we can swap them in post processing as it won't significantly change the overall distribution.
Brute force seems like it's not going to work. As usual, the problem appears to be this fellow:
n [12]: meta.groupby('artist_id')['track_id'].count().sort_values().tail(40)
Out[12]:
artist_id
10117 31
18561 32
11083 32
13061 32
8449 33
19219 33
10341 35
3931 35
10990 36
14774 37
2474 37
22014 38
1640 38
19030 39
16852 39
11099 39
6060 41
11687 42
7079 42
19282 42
3228 45
6443 45
4352 49
8568 49
22472 49
19472 50
77 50
11892 51
13262 53
11431 54
18686 59
22107 59
12351 62
2008 92
7168 112
10012 119
19461 152
15891 153
129 182
12750 227
Our man JSB up there (artist 12750) has 227 tracks. Following that is Ergo Phizmiz, Kosta T, and so on.
The odds of a randomized split getting any one of these artists entirely on one side are deep into Rosencrantz and Guildenstern territory. Incidentally, I'm worried that artist-conditioning here will significantly skew our instrument / genre distributions between train and test. Maybe that's unavoidable.
Some more thoughts:
I indeed also had a similar problem with this guy. I think he is linked to more than 30k songs on Spotify. Far above anyone else.
I am not entirely sure I understand the difference between 2. and 3. What I had in mind is: Let's say you want to assign train/test with this input:
(track_id:'001_001234', piano_yes=1, piano_no=0, violin_yes=0, ..., weight=1)
To obtain
(track_id:'001_001234', split:'train', ...) such that sum_{weight \in train} = 15000 and sum_{piano_yes \in train}/15000 ~= sum_{piano_yes \in test}/5000, ...
Then you do the same with this input:
(artist_id:'12750', piano_yes=A, ..., weight=B)
with A = sum_{track_id \in artist_id:'12750'}(piano_yes) and B = sum_{track_id \in artist_id:'12750'}(weight)
To obtain
(artist_id:'12750', split:'train', ...) such that sum_{weight \in train} ~= 15000 and sum_{piano_yes \in train}/15000 ~= sum_{piano_yes \in test}/5000, ...
I feel like this is not very clear, so do as you think is best ;)
I am not entirely sure I understand the difference between 2. and 3.
Option 2 as I understood it would ignore the labels.
Option 3 would apply the same multi-label->majority class reduction logic that we currently use at the track level, but at the artist level. The idea is as follows:
_negative
if all associations are negative)There are two main problems that I see here, but I think it's probably our best option:
Updates:
I implemented the method described above. I tried a few random splits, and compared the resulting annotation distributions to the overall population. Here's an example of resulting probability ratios:
test_neg | test_pos | train_neg | train_pos | |
---|---|---|---|---|
instrument | ||||
accordion | 1.0 | 1.0 | 1.0 | 1.0 |
banjo | 1.0 | 0.9 | 1.0 | 1.0 |
bass | 1.0 | 1.1 | 1.0 | 1.0 |
cello | 0.9 | 1.1 | 1.0 | 1.0 |
clarinet | 1.0 | 1.0 | 1.0 | 1.0 |
cymbals | 1.0 | 1.0 | 1.0 | 1.0 |
drums | 1.0 | 1.0 | 1.0 | 1.0 |
flute | 1.0 | 1.0 | 1.0 | 1.0 |
guitar | 1.0 | 1.0 | 1.0 | 1.0 |
mallet_percussion | 1.0 | 1.0 | 1.0 | 1.0 |
mandolin | 1.0 | 1.1 | 1.0 | 1.0 |
organ | 1.1 | 0.8 | 1.0 | 1.0 |
piano | 0.9 | 1.1 | 1.0 | 1.0 |
saxophone | 0.9 | 1.1 | 1.0 | 1.0 |
synthesizer | 1.0 | 1.0 | 1.0 | 1.0 |
trombone | 1.0 | 1.0 | 1.0 | 1.0 |
trumpet | 1.0 | 1.0 | 1.0 | 1.0 |
ukulele | 1.0 | 1.1 | 1.0 | 1.0 |
violin | 1.0 | 1.0 | 1.0 | 1.0 |
voice | 1.0 | 1.0 | 1.0 | 1.0 |
_negative | 1.0 | 1.0 | 1.0 | 1.0 |
In this case, the biggest deviation was +organ (test)
having ~0.8 probability ratio compared to the full population. I've seen numbers as low as 0.7, but this gives me an idea. We can over-sample splits, and specify a maximum allowed deviation for accepted splits. If we set that number to 0.95, I'm pretty confident that we can get at least one split to come out. Of course, the sample counts don't pop out to exactly the target 15000/5000, but they're generally pretty close.
I'll do a bit more testing on this, and then push up a new script today.
After a bit of feature comparison testing, I'm now running into some genuine metadata errors that break artist-conditional filtering.
track_id | album_id | album_title | album_url | artist_id | artist_name | |
---|---|---|---|---|---|---|
9352 | 69465 | 12352 | Classwar Karaoke - 0019 Survey | http://freemusicarchive.org/music/07_Elements/... | 14230 | 07_Elements |
10771 | 82089 | 13967 | Elements 001-012 | http://freemusicarchive.org/music/Anthony_Dono... | 15896 | Anthony Donovan |
track_id | album_id | album_title | album_url | artist_id | artist_name | |
---|---|---|---|---|---|---|
10030 | 74328 | 13126 | slavebation | http://freemusicarchive.org/music/christian_cu... | 13491 | Artist Name |
10063 | 74658 | 13178 | Slavebation | http://freemusicarchive.org/music/Christian_Cu... | 15155 | Christian Cummings |
track_id | album_id | album_title | album_url | artist_id | artist_name | |
---|---|---|---|---|---|---|
13072 | 104838 | 16424 | Sectioned v4.0 | http://freemusicarchive.org/music/AL355I0/Sect... | 18306 | AL355I0 |
12972 | 103892 | 16322 | Sectioned v4.0 | http://freemusicarchive.org/music/Section_27_N... | 8563 | Section 27 Netlabel |
It never seems to be more than 3 collisions (out of ~5000 test points), so it's not the end of the world. Still, we could avoid this by building a full feature hash, and applying some light manual correction to the artist ids prior to the artist-conditional splits. It still breaks independence assumptions for training/evaluation (which were never really legit to begin with, but c'est la vie), but at least it would rule out obvious dupes.
Ok, here's a complete list of sample-key feature collisions in the database: (EDIT: corrections)
{'007954_184320': ['120195_184320'],
'069465_549120': ['082089_549120'],
'074328_372480': ['074658_372480'],
'074658_372480': ['074328_372480'],
'082089_549120': ['069465_549120'],
'103892_130560': ['104838_130560'],
'104838_130560': ['103892_130560'],
'116011_341760': ['116322_341760'],
'116322_341760': ['116011_341760'],
'116585_46080': ['116586_46080'],
'116586_46080': ['116585_46080'],
'120195_184320': ['007954_184320']}
The good news is that A) the list is small, and B) the collisions happen only in pairs, so there are no complicated higher-order cliques of colliding tracks to deal with.
At this point, what I'll do is map each colliding sample to the artist id of the lexicographically lower sample_key
, and use that proxy artist id when generating the train-test split. This should fix us up going forward.
Aaaaand the metadata-corrected, pseudo-artist-conditional split seems to do the trick!
(Note: no fliers down near the origin on the 1-nn distance distribution.)
Now to script this up, throw in the rejection threshold parameter, and we should be good to go.
And here's a link to the notebook used to generate the sample deduplication index. I don't think this one needs to be converted into a script (use-once process), but it should be archived along with other development notebooks.
@ejhumphrey @simondurand where would be a good place to store the derived dedupe index? Alongside metadata / sparse labels?
New script is implemented and humming along. I can reliably get a pseudo-artist-conditional split with tolerance of 0.85 (meaning all subsample instrument distributions have 0.85 * p(Y) <= p(Y | train/test) <= p(y) / 0.85), but getting above 0.90 seems to be challenging. Probably due to JSB.
Inspired by a recent tweet on deduping in cifar, I went and performed a similar analysis on our proposed train-test split. We indeed have some dupes in our dataset, though it seems like they're relatively few. This does already occur in our (current) train-test split though, so we'll need to do some cleanup to prevent that from happening.
Procedure
Results
This plot shows the distribution of distances from test to 1st, 2nd, ..., 5th nearest neighbor from the training set. As you can see from the fliers on column 1, there are a few suspiciously low values. In fact, two of them are identically 0. The offending culprits are
Checking the metadata fields, these appear to be exact duplicates with dirty metadata.
There's one more soft near-match, which has distance 9057 (median is about 54000):
which again, by metadata, seems like a like match. (The artist ids are identical, but the other metadata is garbage.)
Since in all of these cases, an artist-id filter using metadata would have caught this, I'm wondering if that's sufficient to fix it? I didn't include this originally because I was under the impression that no artist had more than one track in the collection. Apparently that assumption was incorrect.
I haven't done a similar analysis on the train-train data, but I suspect there are many more dupes than are caught here.
What do you recommend @ejhumphrey @simondurand ? I could try to resolve this in the splitter, but I'm not sure how artist conditioning will play with stratified sampling -- in sklearn splitters, we can do either stratification or grouping, but not both simultaneously. I'll take a look and see if we just got unlucky with this random split, but it'd be nice to have a proper solution in place.