bioinfomaticsCSU / deepsignal

Detecting methylation using signal-level features from Nanopore sequencing reads
GNU General Public License v3.0
108 stars 21 forks source link

Questions on training function of DeepSignal #74

Closed PanZiwei closed 2 years ago

PanZiwei commented 2 years ago

Hi Peng, I am trying to use the training function of DeepSignal to re-implement your process in the paper and train the model on our own. Can you answer several questions if possible?

  1. Do you have the git repo or code to repeat the training process to get the trained model model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+.tar.gz? I believed the fast5 files you used are from https://www.ncbi.nlm.nih.gov/sra?LinkName=bioproject_sra_all&from_uid=533926, are you using all the records named under "HX1 Nanopore sequence data: Sample HX1", or do you just use specific runs?

  2. In the original paper you said that "We randomly select 50% of the H. sapiens and E. coli reads to train models, and test on the remaining 50% of data. To validate the robustness of DeepSignal, we randomly repeat train-test split ten times." However, it seems that the deepsignal train function only supports the individual training dataset and validation dataset. Is that mean that the splitting process of training and testing has to be done before the deepsignal train step? How can I do a repeated train-test split in this case?

  3. Have you ever tried DeepSignal on imbalanced datasets? If so, how is the performance? Or undersampling or oversampling for balancing the datasets is in need to run the DeepSignal training step?

Thank you so much for your help!

PengNi commented 2 years ago

Hi @PanZiwei ,

Thanks for your interest of our tool.

  1. We used 10× reads of HX1 native DNA and 5× HX1 WGA reads (as control) to train the deepsignal model. We got all the HX1 reads directly from nextomics. We also chose high confidence CpGs from BS-seq results: high-confidence methylated sites (methylation frequency=1, coverage>=5), and unmethylated sites (methylation frequency=0, coverage>=5). We randomly selected 10m positive and negetive samples for training. I checked the flowcells we used to train, found that the flowcells were not included in DeepMod paper: 42G BNP18L0036-0504-A1 47G BNP18L0036-0504-A2 22G BNP18L0036-0504-B1 41G BNP18L0036-0504-B2 79G BNP18L0036-0518-C1 45G BNP18L0036-0518-C2 55G BNP18L0036-0521-C3 42G BNP18L0036-0521-C4 42G BNP18L0036-0521-C5 44G BNP18L0036-0521-C6

So to train a new model, I suggest randomly selecting10× HX1 native reads. I listed demo commands to train a deepsignal model below:

# demo cmds for generating training samples
# 1. deepsignal extract (extract features from fast5s)
deepsignal extract --fast5_dir fast5s/ [--corrected_group --basecall_subgroup --reference_path] --methy_label 1 --motifs CG --mod_loc 0 --write_path samples_CG.hc_poses_positive.tsv [--nproc] --positions /path/to/file/contatining/high_confidence/positive/sites.tsv
deepsignal extract --fast5_dir fast5s/ [--corrected_group --basecall_subgroup --reference_path] --methy_label 0 --motifs CG --mod_loc 0 --write_path samples_CG.hc_poses_negative.tsv [--nproc] --positions /path/to/file/contatining/high_confidence/negative/sites.tsv

# 2. randomly select equally number (e.g., 10m) of positive and negative samples
# the selected positive and negative samples then can be combined and used for training, see step 3.
python /path/to/scripts/randsel_file_rows.py --ori_filepath samples_CG.hc_poses_positive.tsv --write_filepath samples_CG.hc_poses_positive.r10m.tsv --num_lines 10000000 --header false &
python /path/to/scripts/randsel_file_rows.py --ori_filepath samples_CG.hc_poses_negative.tsv --write_filepath samples_CG.hc_poses_negative.r10m.tsv --num_lines 10000000 --header false &

# 3. combine positive and negative samples for training
# after combining, the combined file can be splited into two files as training/validating set, see step 4.
python /path/to/scripts/concat_two_files.py --fp1 samples_CG.hc_poses_positive.r10m.tsv --fp2 samples_CG.hc_poses_negative.r10m.tsv --concated_fp samples_CG.hc_poses.r20m.tsv

# 4. split samples for training/validating
# suppose file "samples_CG.hc_poses.r20m.tsv" has 20000000 lines (samples), and we use 200k samples for validation
# the .train.tsv and .valid.tsv can be converted to .bin format to accelerate training (scripts/generate_binary_feature_file.py)
head -19800000 samples_CG.hc_poses.r20m.tsv > samples_CG.hc_poses.r20m.train.tsv
tail -200000 samples_CG.hc_poses.r20m.tsv > samples_CG.hc_poses.r20m.valid.tsv

# 5. train
CUDA_VISIBLE_DEVICES=0 deepsignal train --train_file samples_CG.hc_poses.r20m.train.tsv --valid_file samples_CG.hc_poses.r20m.valid.tsv --model_dir model.deepsignal.CG --display_step 2000

Note that the --positions file contains high-confidence methylated sites, or unmethylated sites, each line in the format of chromosome position strand. The position is 0-based, with a strand (+ or -) following it. It uses Tab (\t) as delimiter. It is like column 1, column 2, and column 6 in a BED6 file.

  1. The data of ecoli and human your quoted are ONT R9 data from nanopolish paper. In this comparison with nanopolish, we split the reads to two equally parts first, then cross-train-and-test the two parts of reads.

  2. As the positive and negative samples of HX1 CpG are sufficient, we didn't try imbalanced datasets. We just randomly sampling 10m positive and 10m negative samples for training.

Hope it helps. BTW, congratulations on your excellent genome biology paper, I learned a lot from it!

Best, Peng

PanZiwei commented 2 years ago

Hi Peng, Thank you so much for the gorgeous and detailed sharing! Really appreciate it! For Q1, is there any extra steps that can accelerate the training process? How long did it take to finish the training process for HX1?

For Q2, if I understand correctly, you did 3 rounds of train testing steps as shown in Table2 of the original paper: No. Training Testing
1 50% E.coli 50% E.coli
2 50% human 50% human
3 50% E.coli 50% human

And for rounds 2 and 3, "To validate the robustness of DeepSignal, we randomly repeat train-test split ten times", you 1) repeat Step 3/4 in the Q1 part 10 times -> 2) Calculate the average and standard deviation of evaluation metrics of 10 replicated train-test splits. Right?

For Q3, I noticed that pos_weight =1 regardless of the balanced state. So which one will be more practical if the input is imbalanced? _1) Set up |pos samples| : |neg samples| = category ratio of imbalanced input, 2) Follow the step: balance the input -> use the posweight = 1?

https://github.com/bioinfomaticsCSU/deepsignal/blob/37741513c6842d411640ed0be948e51d7f2a7a01/deepsignal/train_model.py#L342-L346

Thanks again for your help! Looking forward to your new work on Nanopore and PacBio sequencing!

PanZiwei commented 2 years ago

Also, the best accuracy is 0 when I tried to use the train function. Do you have any idea about the problem? The error message:

/pod/2/li-lab/Ziwei/Anaconda3/envs/deepsignal_0.1.9/lib/python3.6/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/pod/2/li-lab/Ziwei/Anaconda3/envs/deepsignal_0.1.9/lib/python3.6/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

The parameter is shown below:

## parameters:
train_file:
        /fastscratch/c-panz/2021-11-21/test.train.tsv
valid_file:
        /fastscratch/c-panz/2021-11-21/test.valid.tsv
is_binary:
        no
model_dir:
        /fastscratch/c-panz/2021-11-21
log_dir:
        None
is_cnn:
        yes
is_base:
        yes
is_rnn:
        yes
kmer_len:
        17
cent_signals_len:
        360
batch_size:
        512
learning_rate:
        0.001
decay_rate:
        0.1
class_num:
        2
keep_prob:
        0.5
max_epoch_num:
        10
min_epoch_num:
        5
display_step:
        2000
pos_weight:
        1.0
# ===============================================
================ epoch 0 best accuracy: 0.000, best accuracy: 0.000
================ epoch 1 best accuracy: 0.000, best accuracy: 0.000
================ epoch 2 best accuracy: 0.000, best accuracy: 0.000
================ epoch 3 best accuracy: 0.000, best accuracy: 0.000
================ epoch 4 best accuracy: 0.000, best accuracy: 0.000
training finished, costs 1197.8 seconds..
Model training is done!
PengNi commented 2 years ago

Hi @PanZiwei , For Q1, Converting features of samples from .tsv to .bin format (scripts/generate_binary_feature_file.py) can make the tranining faster. I don't remember the precise time cost, but it should be finished in one, or two days, at most.

For Q2, repeat train-test split ten times means that each time, we randomly split the reads into two equally parts, then we trained the model using 50% reads, and then we evaluted the model using the other 50% reads. We repeat the 50%-50% split 10 times in human->human and ecoli->ecoli test. In the ecoli->human test, we didn't split the reads, we just used whole ecoli reads to train, and then used the whole human reads to evaluate.

For Q3, yes I did set a _--posweight parameter in the train module, however I didn't actually use it in practice. In my view, if you can collect millions of both positive and negative samples, then just balance the numbers first; if for example there are only less than 1 or 2 millions positive samples, but there are more than 10 million negative samples, may be we should use the _--posweight parameter.

For the accuracy is 0 issue, the training process checks the model paramters after trainng every _batch_sizedisplaystep samples. So I guess the training data you used has less than 5122000 samples, which makes the training process didn't check the current best model paramters. Trying a smaller _displaystep and/or a smaller _batchsize may make it work.

Best, Peng

PanZiwei commented 2 years ago

Hi Peng, Thank you much for the detailed answers!

For the accuracy issue, I believe the samples are the problem since I am only using a very small dataset(~1000 samples) from HX1 to give any idea of the training process, and I didn't realize that the parameter effect.

I will close the issue at this moment and reopen it if there are more questions rising.

Thank you again for all your help and wish you have a nice week!

Best, Ziwei