abcyzj / ChoreoNet

17 stars 1 forks source link

Question about choreography #1

Open Jhc-china opened 3 years ago

Jhc-china commented 3 years ago

Hi, @abcyzj , I have some problem about the choreography files:

  1. In some choreography annotations, there are some 'START' tags at the beginning of 'movements' annotation, also, the 'HOLD' tags. I cannot find the definition in your paper about the two special annotations. Could you please give a little more explanation?
  2. There are 158 CAUs used in 62 music pieces (including [START] and [HOLD]). What the uses of rest 8 CAUs (you mentioned there are 164 CAUs)?
  3. How to use the 'start_pos' and 'end_pos' in the choreography annotation? According to my understanding, I use these two to compute the beat interval, for example, interval = (end_pos - start_pos) / len(movements), if the CAUs sequence is ['A', 'B', ...], the ground truth is ['SOD', 'NIL' * [start_pos / interval], 'A', 'B', ..., 'EOD']. Is that right?
abcyzj commented 3 years ago

Hello, @Jhc-china , thanks for your questions. Here are my answers.

  1. We didn't mention these two annotations in our paper due to the limitation of space. The 'START' annotation is given by the Tango dancer we invited. It means preparing for the dance to start. We implement it as interpolation between the rest pose and the first frame the next action. As for the 'HOLD' annotation, it means 'do not move and wait for the next action'. We implement it as the interpolation between the last frame of the previous action and the first frame of the next action.
  2. You can find the definitions of all the 164 CAUs in the movement_inverval.csv file. We define and collect 164 but the dancers did use all of them.
  3. Your understanding is correct.
Jhc-china commented 3 years ago

@abcyzj Thanks for your fast reply very much! I have some further questions.

  1. Like [NIL] annotation, should I split the 'HOLD' annotation by beats? i.e., if there is an 'HOLD' annotation last for 2 beats between 'A' and 'B' CAU (the movements annotation may like [..., 'A', 'HOLD', 'B', ...]), the ground-truth it transferred into should be something like [..., 'A', [HOLD], [HOLD], 'B', ...]? ([HOLD] means hold for one beat).
  2. For the 'START' annotation, should I do the same thing above or just use one [START] annotation? Because I find the duration of all 'START' annotation in Tango is 16 beats. Should I view the 'START' as a whole (a CAU) or just like many pieces (last one beat) put together?
  3. I didn't find some CAUs in the music pieces annotations (C\R\T\W.json), e.g. the C-0-7 CAU, which I could find in ll_qq_02_2.c3d file. Is these CAUs not appeared in music pieces annotations is only used for Motion generation model?
  4. Could you give more explanation about the 'fake_start/end_frame' and 'start/end_frame' in the movement_inverval.csv file?
abcyzj commented 3 years ago
  1. The length of each 'HOLD' is given in the durations part of the choreography file. Whether to split it or not depends on your implementation. In our paper, we split the 'HOLD' annotation by beats.
  2. The same as 'HOLD'. As all the 'START' annotation lasts for 16 beats, we treat it as a single CAU.
  3. Yes, choreographers didn't use these CAUs in music pieces annotations so we just use them for motion generation.
  4. The 'fake_start/end_frame' are just some intermediate processing results due to different FPS. You can just ignore them. If you want to load one CAU, you first load the corresponding c3d file into a numpy array $x$, then the CAU clip can be acquired by $x[start_frame:end_frame]$.
Jhc-china commented 3 years ago

@abcyzj Thanks again for your reply! I'm just curious about some implementation of CAU Prediction Model:

  1. The chroma feature is 10fps while the onset and beat are under 100fps. I concatenate the chroma feature with beat and onset feature by repeating the chroma feature 10 times within each frame to increase the 10fps to 100fps. As a result, the input acoustic feature are under 100fps, is that right?
  2. In your paper, "the local music feature is set to 10 second" and implement it by stacking five 1D-Conv layers. Is the kernel_size (local music reception field) for each 1D-Conv layer is 1000(frames) or just 200(frames)? Where the latter one, I think the reception field of last 1D-Conv layer can be equivalent to 1000 frames from input music feature.
  3. According to my understand of Algorithm 1 in your paper, the time-axis of training procedure is beat, which means I should shift the window by N beats according to the beat duration of predicted CAU. While the time-axis of input music feature is millisecond, I thought, 10ms gap between extracted music feature. To obtain the next encoded $m_{t}$, the window in music feature should shift by N * _beatinterval(in ms) / 10. Is that right? There maybe integer conversion in this transition but I think it doesn't matter?
  4. you didn't give the hidden size of MLP in your paper, maybe I think the CAU Prediction Model is not sensitive to these hyper-parameters? Could you give some advice about setting this kind of hyper-parameters in the Conv and MLP layers?
abcyzj commented 3 years ago
  1. We first convolve on these features and then concatenate the convolved features. So we didn't upsample the chroma feature.
  2. The original expression is "The sliding window size of this local musical feature encoder is set to 10 seconds", which does not mean the reception field of each conv layer is set to 10 seconds.
  3. Yes, actually we first detect the beats in the music using librosa and then slide according to the detected beats.
  4. Actually I think we mentioned that the MLP layer outputs a 64-dimension local musical feature. The size of the Conv and MLP layers are decided rather empirically.
Jhc-china commented 3 years ago

Thanks @abcyzj.

  1. Due to the different time scale, there are two different encoder for chroma feature and beat&onset feature?
  2. So you first clip the raw features, convolve them through 5 1D Conv, flatten and concatenate them, finally feed them to MLP, shift the time window and repeat? Why not convolve all raw features in a single forward, thus the $m_{t}$ can be directly obtained from convolved feature at time t, and the encoder forward step in Algorithm 1 can be get rid of from the for loops?
  3. I think you describe the inference stage. In training stage, the beats can be obtained from duration annotation and should shift according to this annotation?
  4. The CAU beats is fixed, but different music BPM is dynamic. According to BPM, the duration of the same CAU may be different. Is there any interpolation used within a CAU to adapt to BPM?
abcyzj commented 3 years ago
  1. Yes.
  2. Yes, I think during inference you can get rid of the loops.
  3. Yes.
  4. As we described in Section 5.2, we align the kinematic beats of the CAUs to the musical beats. We achieve this by applying an affine transformation on the CAU.
Jhc-china commented 3 years ago

Thanks @abcyzj , I have question about the training loss equation(3) in your paper. After Algorithm 1, I got the predicted CAU sequence Y_gen. I think the t in equation(3) is in beat scale, is the nll_loss is computed on each beat or just the first beat in each predicted CAU? Also, I found the model only repeatly predict [NIL] (maybe there are too many [NIL] padded at the front of CAU sequences). How do you handle this case? BTW, I have no idea when performing inverse kinematics to caculate the rotation of human body pose. Could you give some libraries or methods to get the rotation of human body joints from Euclidean coordinates of these joints in your c3d data. Tks!

xpveryrich commented 2 years ago

Thanks @abcyzj , I have question about the training loss equation(3) in your paper. After Algorithm 1, I got the predicted CAU sequence Y_gen. I think the t in equation(3) is in beat scale, is the nll_loss is computed on each beat or just the first beat in each predicted CAU? Also, I found the model only repeatly predict [NIL] (maybe there are too many [NIL] padded at the front of CAU sequences). How do you handle this case? BTW, I have no idea when performing inverse kinematics to caculate the rotation of human body pose. Could you give some libraries or methods to get the rotation of human body joints from Euclidean coordinates of these joints in your c3d data. Tks!

you reproducting the paper now? I have some questions now, can we talk about it?