Custom implementation using encoder part

Jonohas commented 1 year ago

I am working on developing a model that can analyze and compare songs to determine if they have been sampled or not. I discovered a research paper that perfectly aligns with my goal. After some exploration, I arrived in this repo. I am curious about how I can extract the encoded values from the pre-trained model provided in this repo.

It seems to me that the only thing I have to do is pop the last 2 layers from the model as described here

Am I going in the right direction?

bmcfee commented 1 year ago

crema does provide an interface for deriving features / activations directly. The result is a dictionary mapping output names (tag, pitches, root, bass), to numpy arrays with the model output values (log likelihoods) for each time step.

Example:

In [1]: import crema

In [2]: import librosa

In [3]: model = crema.models.chord.ChordModel()

In [4]: features = model.outputs(filename=librosa.ex('brahms'))

In [5]: features
Out[5]: 
{'chord_tag': array([[4.3023601e-03, 4.9892231e-05, 3.5169516e-05, ..., 2.2896288e-03,
         7.4055344e-02, 2.5491721e-03],
        [1.3115157e-03, 1.7348004e-05, 1.8500274e-05, ..., 5.7797926e-03,
         7.8810960e-02, 4.2912671e-03],
        [2.5451521e-04, 4.1380504e-06, 6.1424471e-06, ..., 8.6798873e-03,
         7.6920292e-03, 2.6421892e-03],
        ...,
        [4.4535976e-05, 3.2534017e-06, 2.0313148e-06, ..., 3.6624442e-05,
         9.8973185e-01, 3.5569705e-05],
        [5.2874017e-05, 4.1927369e-06, 3.0838921e-06, ..., 3.8530146e-05,
         9.8953003e-01, 3.4048717e-05],
        [7.5042029e-05, 3.8655908e-06, 5.0735675e-06, ..., 5.0219223e-05,
         9.8538220e-01, 3.6144585e-05]], dtype=float32),
 'chord_pitch': array([[0.1635662 , 0.01280352, 0.24643663, ..., 0.01923487, 0.3156973 ,
         0.0091745 ],
        [0.1336295 , 0.00668535, 0.40660134, ..., 0.02177402, 0.26932114,
         0.00967577],
        [0.045084  , 0.00547308, 0.69742846, ..., 0.0060927 , 0.38758662,
         0.01631224],
        ...,
        [0.0390096 , 0.00611022, 0.13638929, ..., 0.02007931, 0.03636965,
         0.01174816],
        [0.04007465, 0.00648808, 0.14079157, ..., 0.01947397, 0.03879732,
         0.01251665],
        [0.05498511, 0.00731987, 0.13454139, ..., 0.02149376, 0.04942104,
         0.01243559]], dtype=float32),
 'chord_root': array([[2.7323613e-01, 1.4166640e-02, 1.8405104e-02, ..., 2.4535147e-02,
         1.9818617e-03, 5.0903581e-02],
        [1.3547663e-01, 5.4969341e-03, 1.6408626e-02, ..., 1.4272679e-02,
         9.1914646e-04, 4.1961312e-02],
        [4.1177709e-02, 1.2524052e-03, 5.2016061e-03, ..., 3.8624653e-03,
         2.9413245e-04, 5.7008057e-03],
        ...,
        [1.2547920e-03, 3.3207625e-04, 3.8576757e-03, ..., 5.3242833e-04,
         9.6493182e-05, 9.8741525e-01],
        [1.2485510e-03, 3.6570901e-04, 3.5472244e-03, ..., 5.5680086e-04,
         1.1194415e-04, 9.8750532e-01],
        [1.9181310e-03, 4.3262102e-04, 4.3804930e-03, ..., 6.9076067e-04,
         1.4670097e-04, 9.8343408e-01]], dtype=float32),
 'chord_bass': array([[8.3911844e-02, 1.6980510e-02, 2.1775467e-02, ..., 2.1875802e-02,
         2.5467703e-03, 4.4179551e-02],
        [3.1749886e-02, 4.8876167e-03, 1.8029789e-02, ..., 7.5807236e-03,
         7.5101142e-04, 2.6730081e-02],
        [5.5555804e-03, 8.9264929e-04, 5.2044787e-03, ..., 1.5763103e-03,
         2.1533313e-04, 2.6487890e-03],
        ...,
        [1.1341659e-03, 2.2324428e-04, 2.7466586e-03, ..., 4.4924239e-04,
         9.0148700e-05, 9.8818821e-01],
        [1.0704539e-03, 2.3396697e-04, 2.4984695e-03, ..., 4.4803572e-04,
         1.0931564e-04, 9.8869842e-01],
        [1.6342687e-03, 2.7579299e-04, 3.2657972e-03, ..., 5.1576976e-04,
         1.4697264e-04, 9.8482496e-01]], dtype=float32)}

Jonohas commented 1 year ago

Is this the output of the encoder?

bmcfee commented 1 year ago

It's the collection of all output heads of the network - green and orange blocks in fig 3 of https://brianmcfee.net/papers/ismir2017_chord.pdf .

I believe the work you referred to above mostly uses the "pitch" output head, and the rest can be ignored.

Jonohas commented 1 year ago

I am mostly interested in the last part of fig 1. I believe that would be the output of the brown layer in fig 3 correct?

bmcfee commented 1 year ago

Oh, you mean the gru hidden states before they are transformed into pitch/chord classes. I don't think that's what move/remove use, and crema does not expose that part through the api. If you want it, you'll need to hack the underlying keras model.

Jonohas commented 1 year ago

They refer to crema-PCP, are you familiar with that? That would be the pitch that you mentioned then? I'm sorry, it is hard to understand the connections between all the papers

bmcfee commented 1 year ago

"crema-pcp" is their name for the pitch output head.

Jonohas commented 1 year ago

Oh i see. I misunderstood. Thank you for your patience with me. Will be closing this then.

bmcfee / crema

Custom implementation using encoder part #39