Regression training problem

ytang0831 commented 1 year ago

Hi，Dohoon！ I'm training on the latest version of chromoformer-1.1.1. But I still face some problems.

I want to use my own data (5 histone modifications) for training. Since I only used 5 HMs, I changed n_feats to 5, Meanwhile, x_pcres.append(torch.zeros([7, n_dummies max_n_bins]) in data.py was changed to x_pcres.append(torch.zeros([5, n_dummies max_n_bins])). For benchmark, I used the method in snakefile to download and process the data of H3K36me3, H3K9me3, H3K4me3, H3K27me3, and H3K27ac of e003. The training results were not good. For example :

My own data: E4 1.2609, lr=1.9755090000000003e-05, r2=2.8943, r=40.1414 train_pred tensor([1.1884, 1.1867, 1.1938, 1.1672, 1.1879, 1.2064, 1.1774, 1.1597, 1.1676, 1.2042, 1.1927, 1.1942, 1.1437, 1.2036, 1.1939, 1.1115, 1.1496, 1.1728, 1.1622, 1.1368, 1.1721, 1.1567, 1.1437, 1.1713, 1.1811, 1.1880, 1.1607, 1.1411, 1.1769, 1.1734, 1.2052, 1.1861, 1.1821, 1.1568, 1.1554, 1.1430, 1.1501, 1.1744, 1.1765, 1.1775, 1.1318, 1.1354, 1.1342, 1.1645, 1.1633, 1.1683, 1.1499, 1.1734, 1.1870, 1.1440, 1.1761, 1.1636, 1.1872, 1.1765, 1.1612, 1.1797, 1.1683, 1.1504, 1.1305, 1.1920, 1.1727, 1.1765, 1.1521, 1.1658, 1.1843, 1.1715, 1.1697, 1.1584, 1.1826, 1.1833, 1.1409, 1.1678,…… train_label tensor([2.9441e+00, 2.9447e+00, 2.2693e+00, 8.1698e-01, 2.2235e+00, 0.0000e+00, 5.4318e-01, 1.7248e+00, 0.0000e+00, 3.6210e-01, 2.4192e+00, 3.0718e+00, 2.3477e-02, 1.4763e+00, 2.1545e-01, 6.9318e-01, 3.3394e-02, 3.1614e-01, 9.9438e-01, 8.6140e-01, 1.0275e+00, 9.7922e-01, 0.0000e+00, 1.5047e+00, 1.1517e+00, 9.7657e-01, 2.0475e+00, 3.0364e-02, 2.0634e+00, 3.1769e+00, 1.4338e+00, 2.2431e+00, 1.1195e+00, 9.3238e-02, 4.9067e-01, 1.7837e-01, 0.0000e+00, 2.2651e+00, 1.0775e-01, 1.4504e-01, 1.2038e-01, 1.1054e-01, 7.6351e-02, 1.5619e+00, 9.2099e-01, 2.2537e+00, 2.2126e+00, 2.5213e+00, 3.2175e-01, 4.8718e-01, 2.5095e+00, 1.2302e+00, 2.8014e+00, 2.3331e+00,

E003 data: E1 4.2925, lr=3e-05, r2=-0.2621, r=-3.9956:
train_pred tensor([2.3335, 2.3279, 2.2993, 2.3252, 2.3244, 2.3241, 2.3258, 2.3251, 2.3311, 2.3254, 2.3271, 2.3269, 2.3255, 2.3279, 2.3236, 2.3397, 2.3315, 2.3259, 2.3249, 2.3284, 2.3293, 2.3254, 2.3302, 2.3289, 2.3247, 2.3250, 2.3257, 2.2522, 2.3402, 2.3247, 2.3402, 2.2169, 2.3257, 2.3252, 2.3245, 2.1985, 2.3345, 2.3250, 2.3254, 2.3244, 2.3251, 2.3402, 2.3483, 2.3253, 2.3249, 2.3283, 2.3460, 2.3269, 2.3402, 2.3249, 2.3246, 2.3253, 2.3241, 2.3246, 2.3251, 2.3346, 2.3487, 2.3255, 2.3256, 2.3396, 2.2425, 2.3279, 2.1751, 2.3256, 2.3386, 2.3371, 2.3364, 2.3378, 2.1587, 2.3399, 2.3518, 2.3520, …… trian_label tensor([2.9635e+00, 2.0058e-02, 2.4263e+00, 3.6058e+00, 2.1020e+00, 2.7885e+00, 4.9141e+00, 4.9631e-02, 2.9182e+00, 3.7861e+00, 2.9960e+00, 8.0571e-01, 1.5056e-01, 2.7381e-01, 3.2806e-02, 4.1201e+00, 1.0420e+01, 2.5794e+00, 1.0568e-01, 0.0000e+00, 7.7206e+00, 0.0000e+00, 5.7832e+00, 2.3403e+00, 2.7512e+00, 4.4142e+00, 3.6942e+00, 4.3189e-01, 0.0000e+00, 1.3271e+00, 3.4216e-02, 2.9808e+00, 3.7439e+00, 2.5738e-02, 4.7923e-01, 4.5443e-02, 3.7533e+00, 5.7262e+00, 3.7567e+00, 7.4505e-02, 0.0000e+00, 4.8326e+00, 4.0688e+00, 5.3506e-01, 2.6239e+00, 6.5076e-01, 2.1319e+00, 1.3940e+00,……

Do you know what's wrong? The trained predictions looks very concentrated

ytang0831 commented 1 year ago

Also, I used demo/demo_data and demo/demo_meta.csv for benchmark, with 7 HMs.

The training results: E9 7.2638, lr=9.846350146311365e-06, r2=-80.6655, r=-24.9655 trian_pred tensor([0.6373, 0.5817, 0.2736, 0.8219, 0.6994, 0.4239, 0.6603, 0.8454, 0.6639, 0.6467, 0.6379, 0.6413, 0.7249, 0.7922, 0.6413, 0.5686, 0.6988, 0.7390, 0.7294, 0.6750, 0.6222, 0.5160, 0.6834, 0.2539, 0.6294, 0.7066, 0.6722, 0.5910, 0.5635, 0.7796, 0.6274, 0.7061, 0.8262, 0.2335, 0.5510, 0.7344, 0.6266, 0.5898, 0.3109, 0.7708, 0.6337, 0.6028, 0.7746, 0.6757, 0.6747, 0.6267, 0.8135, 0.8407, 0.6051, 0.6274, 0.5671, 0.5294, 0.6441, 0.7506, 0.7341, 0.7641, 0.6174, 0.7198, 0.7402, 0.6175, 0.6794, 0.5806, 0.6255, 0.7049]) train_label tensor([6.1508, 0.0909, 4.0579, 0.5440, 0.0000, 5.0841, 4.1253, 0.7849, 3.6741, 3.5622, 0.5079, 1.6955, 4.7020, 0.0000, 6.1968, 1.0279, 5.3579, 1.8057, 0.3391, 4.0711, 3.3025, 0.0881, 3.0202, 3.2845, 4.0114, 3.3885, 3.2947, 5.7746, 1.6126, 1.9445, 3.9327, 6.7010, 0.0000, 3.0559, 0.0272, 0.0000, 3.0124, 5.7083, 5.4998, 0.0000, 0.0158, 0.8327, 0.0115, 0.8082, 4.1360, 1.1725, 0.3254, 2.2348, 2.4735, 2.6973, 0.2016, 0.5993, 3.5890, 0.7058, 6.6258, 0.5965, 0.7999, 0.0000, 2.5756, 1.7600, 3.1917, 0.1814, 3.7738, 4.1881], dtype=torch.float64)

dohlee commented 1 year ago

Hi, thank you for the detailed report. First of all, I made data.py to be parameterized with n_feats parameter. This will be in effect since v1.1.2, which will be released after fixing this issue.

Could you please share a meta.csv file used for training with your own data? It'll help a lot.

ytang0831 commented 1 year ago

Hi, thanks for your reply. Here is the demo of my own data with 5HMs, including my sequencing data and E003 data. https://drive.google.com/file/d/1-dr4h1xAAUxqHAeGTv_UsMbNjR1AWG7h/view?usp=sharing

dohlee commented 1 year ago

Good. I'll look into it and get back to you.

Just to make sure, are you training with full dataset (~15000 genes for training and ~4000 for validation), not the small demo data you provided?

ytang0831 commented 1 year ago

@dohlee Yes, I do training with full dataset.

ytang0831 commented 1 year ago

Hi， @dohlee , I tried to use E003 full dataset with 7HMs, still cannot reproduce the result

dohlee commented 1 year ago

Finally I figured out the reason. The loss for regression training was misconfigured, and it is fixed in v1.1.2. Now I'm testing with E003 and it seems to be trained properly! (r2 > ~50, r > ~74 at first epoch).

Please try with v1.1.2 and let me know if it works on your data!

Thanks

ytang0831 commented 1 year ago

Thanks, I will try v1.1.2 soon

ytang0831 commented 1 year ago

@dohlee, Hi, v1.1.2 works very well ! And r ~ 81 at average epoch using my own data! Many thanks!

dohlee commented 1 year ago

Good to hear that!

dohlee / chromoformer

Regression training problem #7