bowang-lab / scGPT

https://scgpt.readthedocs.io/en/latest/
MIT License
979 stars 182 forks source link

mre is increasing while mse is decreasing #171

Open WhenMelancholy opened 5 months ago

WhenMelancholy commented 5 months ago

Hi, thank you for such wondorful work!

I am trying to pretrain scGPT for in a small dataset and I am using the pipeline in the dev-temp branch (I merged it with the main branch). After solving the issues related to library version\flash-attn I finally make the pretrain.py works! But I found the train loss is a little bit strange.

This is part of the training log:

scGPT - INFO - -----------------------------------------------------------------------------------------                                                                                 
scGPT - INFO - | end of epoch 186 | time:  8.76s | valid loss/mse 157.0726 | mre 1.4009                                                                                                  
scGPT - INFO - -----------------------------------------------------------------------------------------                                                                                 

scGPT - INFO - Saving the best model to ./save/eval-Mar25-14-06-2024                                                                                                                     
scGPT - INFO - -----------------------------------------------------------------------------------------                                                                                 
scGPT - INFO - | end of epoch 187 | time:  8.13s | valid loss/mse 158.1381 | mre 1.4314                                                                                                  
scGPT - INFO - -----------------------------------------------------------------------------------------                                                                                 

scGPT - INFO - -----------------------------------------------------------------------------------------                                                                                 
scGPT - INFO - | end of epoch 188 | time:  5.81s | valid loss/mse 157.6718 | mre 1.3989                                                                                                  
scGPT - INFO - -----------------------------------------------------------------------------------------                                                                                 

scGPT - INFO - -----------------------------------------------------------------------------------------                                                                                 
scGPT - INFO - | end of epoch 189 | time:  6.14s | valid loss/mse 158.9929 | mre 1.4236                                                                                                  
scGPT - INFO - -----------------------------------------------------------------------------------------                                                                                 

scGPT - INFO - -----------------------------------------------------------------------------------------                                                                                 
scGPT - INFO - | end of epoch 190 | time:  8.69s | valid loss/mse 158.0198 | mre 1.4282                                                                                                  
scGPT - INFO - -----------------------------------------------------------------------------------------                                                                                 

scGPT - INFO - -----------------------------------------------------------------------------------------                                                                                 
scGPT - INFO - | end of epoch 191 | time:  8.16s | valid loss/mse 158.5909 | mre 1.4189                                                                                                  
scGPT - INFO - -----------------------------------------------------------------------------------------                                                                                 

scGPT - INFO - -----------------------------------------------------------------------------------------                                                                                 
scGPT - INFO - | end of epoch 192 | time:  9.12s | valid loss/mse 158.4677 | mre 1.4159                                                                                                  
scGPT - INFO - -----------------------------------------------------------------------------------------                                                                                 

scGPT - INFO - -----------------------------------------------------------------------------------------                                                                                 
scGPT - INFO - | end of epoch 193 | time:  8.12s | valid loss/mse 158.7186 | mre 1.4422                                                                                                  
scGPT - INFO - -----------------------------------------------------------------------------------------                                                                                 

scGPT - INFO - -----------------------------------------------------------------------------------------

You can see the valid loss is pretty large and the mre is increasing.

My training command is:

DATASET="path to dataset"
LOG_INTERVAL=100
VALID_SIZE_OR_RATIO=0.1
MAX_LENGTH=1200
per_proc_batch_size=64
LAYERS=4
MODEL_SCALE=1

python ./examples/pretrain.py \
    --data-source $DATASET \
    --save-dir ./save/eval-$(date +%b%d-%H-%M-%Y) \
    --max-seq-len $MAX_LENGTH \
    --batch-size $per_proc_batch_size \
    --eval-batch-size $(($per_proc_batch_size * 2)) \
    --epochs 10000 \
    --log-interval $LOG_INTERVAL --save-interval 10000 \
    --no-cls \
    --no-cce \
    --fp16 \
    --vocab-path "path to vocab.json" \
    --nlayers 2 --nheads 2 --embsize 32 --d-hid 32

I was wondering how normal train looks like and any help are welcome!

WhenMelancholy commented 5 months ago

@subercui Hi, may I ask the details about your train curve? I was wondering the train log above is correct or not.