bmcfee / pumpp

practically universal music pre-processor
ISC License
60 stars 11 forks source link

vector #102

Open jeffd75 opened 5 years ago

jeffd75 commented 5 years ago

Description

Example: strange behavior when trying to use JAMS vector annotations in pump

Steps/Code to Reproduce

from future import division, print_function, unicode_literals import numpy as np import jams import pumpp from librosa import note_to_hz

audio_f='VSL440Rev0.aif' jams_f='VSL440Rev0.jams' sr, hop_length = 44100, 512

p_cqt = pumpp.feature.CQT(name='cqt', sr=sr, hop_length=hop_length, n_octaves=8, over_sample=1, fmin = note_to_hz('C2'), log=True)

p_vector = pumpp.task.VectorTransformer(name="classes", namespace="vector", dimension=6, dtype=np.int32)

pump = pumpp.Pump(p_cqt, p_vector) data = pump(audio_f=audio_f, jam=jams_f)

print (data['cqt/mag'].shape) print (data['cqt/phase'].shape) print (data['classes/vector'].shape) print(data['classes/vector']) print (data ['classes/_valid'].shape) print (data ['classes/_valid'])

Expected Results

My 2157 seconds long audio has been annotated with a 6-dimensional vector using the JAMS format. I ask for a 96-bin CQT, I get 2 tensors with 185815 frames of magnitude and phase CQT, so far so good. But then when I look at the (vector) annotations, I have only one frame which matches the first annotation in my file, and that's it. I am obviously expecting a Tensor with an eventually smaller number of valid frames -shaped say 185652 x 6

I used pumpp.task.VectorTransformer because it seemed the most obvious candidate to do the job. But I saw it inherited from a class called BaseTaskTransformer and its init forced sr and hopsize to 1. I tried to change them with these two lines of code : p_vector.sr=sr p_vector.hop_length=hop_length but it would not change the result

What am I missing ? If this is not the right class to use to process vector annotations, please tell me that there is another one.. ;-)

Actual Results

(1, 185815, 96) (1, 185815, 96) (1, 6) [[ 0 2 0 3 10 -1]] (1, 2) [[ 0 185652]]

Versions

Darwin-16.7.0-x86_64-i386-64bit Python 3.6.4 (v3.6.4:d48ecebad5, Dec 18 2017, 21:07:28) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] NumPy 1.13.1 SciPy 1.2.0 librosa 0.6.2

Here is my actual JAMS vector annotation file (I had to add the .txt extension to upload it here) : VSL440Rev0.jams.txt NB Its syntax was checked with jams.load and it is correct.

bmcfee commented 5 years ago

But then when I look at the (vector) annotations, I have only one frame which matches the first annotation in my file, and that's it.

That's the intended behavior, sorry if it's not documented well enough. (This is still very much a WIP!)

The intended use case here is for things like parametric embedding approximation, where you map an entire recording down to a fixed-dimensional vector. This comes up in things like latent factor approximation in collaborative filtering.

If you want to broadcast the vector out over time, you'd have to know the target extent (number of frames), which isn't generally known with independent transformers. Eg your vector transformer would have to know about the CQT transformer inside the pump, and they don't currently support that kind of behavior. Probably your best bet is to reshape it at run-time (eg sampling during training), or design your model to be time-invariant to cut down on redundant memory consumption.

jeffd75 commented 5 years ago

Let me explain more generally my problem. Maybe you will be able to help me ? I want to annotate audio with (contemporary) instrument playing techniques (PT). I have started with the cello. Possible cello PT would be for instance : pizzicato, near the bridge ; or artificial harmonics with the bow, tremolo ; and so forth. Cello PT in my model are discrete and 4-dimensional. I have added 2 dimensions for less important context data. Hence the 6-dim vector. In the end the goal is to teach a deep CNN with a multi-task approach.

I want my work to be easy to benchmark, I naively tried to fit my annotations in the JAMS standard but I am having a hard time ; the basic chord was doing the trick in the original specification ; but as I understand the changes, I cannot use it anymore ; because chords now follow a precise syntax .. stuff like A7, G9... Now I thought the vector would work but it looks like I indeed misunderstood what it stood for. Any suggestions ?

Regarding your last paragraph ("if you want..") I don't see the difference between the broadcasting of chords and what I am asking for (except the chord syntax issue...). Surely your ChordTransformer synchronises with your CQT transformer, right? Is there a way I could use that one?
Looking forward to reading you Thx for your help !

bmcfee commented 5 years ago

Ah, I see. That's an interesting setup, and not one that I've thought too carefully about, but yeah, it ought to be possible.

Now I thought the vector would work but it looks like I indeed misunderstood what it stood for. Any suggestions ?

One option might be to model them as tags, rather than dense vector data. If you have some scalar value associated with them (eg the amount of vibrato, or something like that), you could pack that into the confidence field. Then you could use DynamicTaskTransformer directly.

Surely your ChordTransformer synchronises with your CQT transformer, right? Is there a way I could use that one?

It doesn't actually synchronize to the feature transformer. Rather, it samples the annotations at a specified rate (given clumsily in terms of sampling rate and hop length, to make it easier to parametrize in terms of audio). The reason for this decision is that the typical use-case for pumpp has features going through a model, and then being compared to the task outputs. Models often have some change of resolution associated with them (eg pooling in time or downsampling), so this lets us generate output frames to match whatever the rate of the model is, rather than being tied to the rate given by the input features.

As I said above, the vector transformer wasn't really designed for this kind of use case because I hadn't considered time-varying vector data. We definitely could add a DynamicVectorTransformer class that does frame interpolation / broadcasting replication of static vector data (like how DynamicTagTransformer samples labeled tag intervals), but that's not currently implemented.

jeffd75 commented 5 years ago

I can use tag_open with a string which is made of my 6 integers separated with say, spaces. I don't even need the confidence field. And then down the line process the tensors to separate the 6 dimensions. Far from ideal, but for the time being, yes it could work. Thx !

bmcfee commented 5 years ago

Oh I was just thinking of each of your six integers getting their own tag. (pizzicato => yes/no, etc). Maybe that doesn't make sense for your data or model.

jeffd75 commented 5 years ago

in contemporary music, it's a bit more complex than that :

first integer is what kind of exciter/vibrator couple you use : for instance, pizzicato means you're not using the bow but the finger to excite the string but you may also use the wood of the bow (con legno); you may even hit the body of the instrument with your hand or fingers, ... second integer is what you do with the left hand (the hand producing the pitch) : vibrating the note or not, glissando, trill,... third integer is about the amplitude envelope of the sound you're producing : playing tremolo, or staccato, marcato, spiccato...
last integer is the position of the interaction : near the fingerboard, ordinary, near the bridge... And believe it or not, this is actually a simplification!

This model could be used for all the strings but we would need others for wind instruments, brasses and percussions.

jeffd75 commented 5 years ago

Sorry to bother you again. You said : "Then you could use DynamicTaskTransformer directly." I can only find BasicTaskTransformer or DynamicLabelTransformer. Guess you meant the latter ?

bmcfee commented 5 years ago

Sorry, yes. I meant https://pumpp.readthedocs.io/en/latest/generated/pumpp.task.DynamicLabelTransformer.html

jeffd75 commented 5 years ago

Hi I tweaked my model a little to use _tagopen in JAMS (collapsed my 4 dimensions into just one) and the pumpp.task.DynamicLabelTransformer It is working just fine - so I ought to thank you... except / it seems that in the resulting tensor the classes are sorted alphabetically. Is there a way to avoid that ? It obviously comes from the behavior of the sklearn.preprocessing.MultiLabelBinarizer class you're using. I need my classes to be exactly in the order I gave to the pumpp.task.DynamicLabelTransformer object.

More importantly : I also added "onset" and "pitch contour" information in the JAMS format. I would like to sample that information at the usual (sampling rate/hop length) frequency Any suggestion using your TaskTransformers ? That is contextual information which I am going to need in order to analyse the behavior of my neural net.

Once this is done I have over 18 hours of cello to process...

bmcfee commented 5 years ago

first integer is what kind of exciter/vibrator couple you use : for instance, pizzicato means you're not using the bow but the finger to excite the string but you may also use the wood of the bow (con legno); you may even hit the body of the instrument with your hand or fingers, ... second integer is what you do with the left hand (the hand producing the pitch) : vibrating the note or not, glissando, trill,... third integer is about the amplitude envelope of the sound you're producing : playing tremolo, or staccato, marcato, spiccato... last integer is the position of the interaction : near the fingerboard, ordinary, near the bridge... And believe it or not, this is actually a simplification!

Following up on this: why use integers instead of independent tags for each of the values?

jeffd75 commented 5 years ago

Sorry being originally a composer I am a bit new to all this. you can have tags but some of them are mutually exclusive and some aren't. it is extremely important for me to feed that info into the model ex. "pizzicato" cannot be together with "con legno" (both on first axis) but it can be "glissando" for instance (first and second axis) and it can be "near the bridge" (first and last axis) In terms of machine learning, the four axes can be seen as 4 different tasks ; except there will be a single NN for all 4 tasks.