Training produces weird results

3f6a commented 2 months ago

I'm trying to follow the instructions in the readme to train a CMVAE model on a different family. I wrote the following script following the instructions in the readme:

https://gist.github.com/3f6a/7c8e844293bf5be4ce51971905ce0cd5

However after running that, I tried to evaluate the sampled sequences with Infernal, and it seems they are very bad. Computing the scores of the sampled sequences with cmalign gives very poor scores (worse than random).

Am I doing something wrong with the linked script?

Shunsuke-1994 commented 2 months ago

Could you give me all files? I will reproduce.

3f6a commented 2 months ago

As far as I can see, the only files you need are the CM model (that you can download from https://rfam.org/family/RF00162/cm), and the hits FASTA file (https://rfam.org/family/RF00162#tabview=tab1). Please let me know if something else is missing and if you can reproduce the issue.

Shunsuke-1994 commented 2 months ago

I tried to apply RfamGen to RF00162, but I couldn't reproduce it. Rather, I got high bit scores of the generated sequences as shown in the attached figure. I guess the reason might be epoch number you used. Epoch =3 may be too early. Anyway, I added my example (notebooks/trial_RF00162.ipynb). In this notebook, I stopped learning at epoch = 60 (before early stop). I hope this would be your help. bitscores_hist

Best, Shunsuke

3f6a commented 2 months ago

If you look at the script I used (https://gist.github.com/3f6a/7c8e844293bf5be4ce51971905ce0cd5), I set epoch=200, not 3. Am I missing something?

I'm taking a look at the notebook you shared, thanks.

Shunsuke-1994 commented 2 months ago

you are sampling from a model at epoch 3 at L43. --ckpt ./outputs/$RFAM_FAMILY/model_epoch3.pt \

In case you want to save intermediate models, u can add --save_ckpt --ckpt_iter at L32-39.

pixi run python scripts/train.py \
--data_dir datasets/$RFAM_FAMILY \
--X_train "$RFAM_FAMILY"_unique_seed_removed_notrunc_traceback_onehot_cm_train.h5 \
--w_train "$RFAM_FAMILY"_unique_seed_removed_notrunc_traceback_onehot_cm_train_weight_threshold0p1.h5 \
--X_valid "$RFAM_FAMILY"_unique_seed_removed_notrunc_traceback_onehot_cm_valid.h5 \
--w_valid "$RFAM_FAMILY"_unique_seed_removed_notrunc_traceback_onehot_cm_valid_weight_threshold0p1.h5 \
--epoch 200 \
--beta 1e-3 --use_anneal --use_early_stopping \
--log --log_dir ./outputs/$RFAM_FAMILY

3f6a commented 2 months ago

you are sampling from a model at epoch 3 at L43.

Oh I missed that. Thanks for noticing! I'll try again.

Shunsuke-1994 commented 2 months ago

I guess the issued should be solved. Let me close.

3f6a commented 2 months ago

There is one more thing that worries me.

Here is an excerpt from the log.csv file that I get after training (first few lines only):

loss,kl,elbo,alpha
0.0,3.149926,0.0,0.0
0.0,3.1982505,0.0001342745,4.198373130411965e-05
0.0,4.1096373,0.0003450758,8.39674626082393e-05
0.0,1.6028905,0.00020188597,0.00012595119391235895
0.0,0.9293184,0.00015606501,0.0001679349252164786
0.0,0.42704636,8.964499e-05,0.00020991865652059824

The loss (always zero) seems very different to what you get in your own runs. Any idea why?

As far as I can see, the script that I'm using (https://gist.github.com/3f6a/7c8e844293bf5be4ce51971905ce0cd5) is identical to your notebook (https://github.com/Shunsuke-1994/rfamgen/blob/main/notebooks/trial_RF00162.ipynb)

Shunsuke-1994 commented 2 months ago

Could you share your dataset and outputs files (config, weights, h5 files, sequences, log etc.) or link of google drive via email(oolongtea1980[at]gmail.com) or something? Let me check.

3f6a commented 2 months ago

@Shunsuke-1994 I have sent you an invite on Google drive

Shunsuke-1994 commented 2 months ago

Thank you for sharing the files, but I’m afraid I can’t reproduce or figure out your problem because the shared files are only outputs. Can I confirm whether the sequences you used are ones available from here (https://rfam.org/family/RF00162#tabview=tab1) again? If I assume the script is unchanged and we are using the same scripts, I guess the reason why your elbo is too small may be comming from too small weights of reconst loss. Although I am not sure if the problem is stemming from your weights, Too small weights likely to happen when the sequences are highly conserved and close to each other. In this case, you need to reduce the threshold of weighting (e.g., 0.1 -> 0.01). However, RF00162 of RfamDB is highly diverse and weights can't be close to zero with threshold =0.1, as you see the notebook I shared. This is why I wanted to check your training data and weights. Anyway, I suggest to check your weight. For more details, readme or this paper. Just in case, I will ask a lab member on our side to try to reproduce your issue, and I'll contact you if we are able to replicate your problem.

Fingers crossed.

3f6a commented 2 months ago

Hello, yes I'm using the RF00162 sequences. If it helps I could share more files? Perhaps the dataset folder?

3f6a commented 2 months ago

Hi, I just tried training again, this time using Biopython 1.77 (which still has the alphabets), and now I'm getting non-zero loss. So it's possible that using the newer Biopython (that got rid of alphabets) is leading to some parsing issues ...

3f6a commented 2 months ago

Hi, just to let you know that now I'm getting high bitscores too. Thanks for your assistance!

Shunsuke-1994 commented 2 months ago

I see. I am glad to hear it worked.

Shunsuke-1994 / rfamgen

Training produces weird results #5