Closed 3f6a closed 2 months ago
Could you give me all files? I will reproduce.
As far as I can see, the only files you need are the CM model (that you can download from https://rfam.org/family/RF00162/cm), and the hits FASTA file (https://rfam.org/family/RF00162#tabview=tab1). Please let me know if something else is missing and if you can reproduce the issue.
I tried to apply RfamGen to RF00162, but I couldn't reproduce it.
Rather, I got high bit scores of the generated sequences as shown in the attached figure.
I guess the reason might be epoch number you used. Epoch =3 may be too early.
Anyway, I added my example (notebooks/trial_RF00162.ipynb
). In this notebook, I stopped learning at epoch = 60 (before early stop).
I hope this would be your help.
Best, Shunsuke
If you look at the script I used (https://gist.github.com/3f6a/7c8e844293bf5be4ce51971905ce0cd5), I set epoch=200, not 3. Am I missing something?
I'm taking a look at the notebook you shared, thanks.
you are sampling from a model at epoch 3 at L43.
--ckpt ./outputs/$RFAM_FAMILY/model_epoch3.pt \
In case you want to save intermediate models, u can add --save_ckpt --ckpt_iter
at L32-39.
pixi run python scripts/train.py \
--data_dir datasets/$RFAM_FAMILY \
--X_train "$RFAM_FAMILY"_unique_seed_removed_notrunc_traceback_onehot_cm_train.h5 \
--w_train "$RFAM_FAMILY"_unique_seed_removed_notrunc_traceback_onehot_cm_train_weight_threshold0p1.h5 \
--X_valid "$RFAM_FAMILY"_unique_seed_removed_notrunc_traceback_onehot_cm_valid.h5 \
--w_valid "$RFAM_FAMILY"_unique_seed_removed_notrunc_traceback_onehot_cm_valid_weight_threshold0p1.h5 \
--epoch 200 \
--beta 1e-3 --use_anneal --use_early_stopping \
--log --log_dir ./outputs/$RFAM_FAMILY
you are sampling from a model at epoch 3 at L43.
Oh I missed that. Thanks for noticing! I'll try again.
I guess the issued should be solved. Let me close.
There is one more thing that worries me.
Here is an excerpt from the log.csv
file that I get after training (first few lines only):
loss,kl,elbo,alpha
0.0,3.149926,0.0,0.0
0.0,3.1982505,0.0001342745,4.198373130411965e-05
0.0,4.1096373,0.0003450758,8.39674626082393e-05
0.0,1.6028905,0.00020188597,0.00012595119391235895
0.0,0.9293184,0.00015606501,0.0001679349252164786
0.0,0.42704636,8.964499e-05,0.00020991865652059824
The loss (always zero) seems very different to what you get in your own runs. Any idea why?
As far as I can see, the script that I'm using (https://gist.github.com/3f6a/7c8e844293bf5be4ce51971905ce0cd5) is identical to your notebook (https://github.com/Shunsuke-1994/rfamgen/blob/main/notebooks/trial_RF00162.ipynb)
Could you share your dataset and outputs files (config, weights, h5 files, sequences, log etc.) or link of google drive via email(oolongtea1980[at]gmail.com
) or something?
Let me check.
@Shunsuke-1994 I have sent you an invite on Google drive
Thank you for sharing the files, but I’m afraid I can’t reproduce or figure out your problem because the shared files are only outputs. Can I confirm whether the sequences you used are ones available from here (https://rfam.org/family/RF00162#tabview=tab1) again? If I assume the script is unchanged and we are using the same scripts, I guess the reason why your elbo is too small may be comming from too small weights of reconst loss. Although I am not sure if the problem is stemming from your weights, Too small weights likely to happen when the sequences are highly conserved and close to each other. In this case, you need to reduce the threshold of weighting (e.g., 0.1 -> 0.01). However, RF00162 of RfamDB is highly diverse and weights can't be close to zero with threshold =0.1, as you see the notebook I shared. This is why I wanted to check your training data and weights. Anyway, I suggest to check your weight. For more details, readme or this paper. Just in case, I will ask a lab member on our side to try to reproduce your issue, and I'll contact you if we are able to replicate your problem.
Fingers crossed.
Hello, yes I'm using the RF00162 sequences. If it helps I could share more files? Perhaps the dataset folder?
Hi, I just tried training again, this time using Biopython 1.77 (which still has the alphabets), and now I'm getting non-zero loss. So it's possible that using the newer Biopython (that got rid of alphabets) is leading to some parsing issues ...
Hi, just to let you know that now I'm getting high bitscores too. Thanks for your assistance!
I see. I am glad to hear it worked.
I'm trying to follow the instructions in the readme to train a CMVAE model on a different family. I wrote the following script following the instructions in the readme:
https://gist.github.com/3f6a/7c8e844293bf5be4ce51971905ce0cd5
However after running that, I tried to evaluate the sampled sequences with Infernal, and it seems they are very bad. Computing the scores of the sampled sequences with
cmalign
gives very poor scores (worse than random).Am I doing something wrong with the linked script?