PengNi / deepsignal2

GNU General Public License v3.0
27 stars 4 forks source link

train new model using my own data #7

Closed Flower9618 closed 3 years ago

Flower9618 commented 3 years ago

Hi, thank you very much for providing this useful tool. I want to train new model using my own data. But, I do not know what the format of the train and valid data is when I use 'deepsingal2 train...' command, and I can not find the information about it. Could you tell me something about it, or tell me where I can find this kind of information about it. Is it fast5s.CG.features.tsv that used as the input for training model? Many thanks.

PengNi commented 3 years ago

Hi, @Flower9618 , thanks for your interest.

To train a new model of deepsignal2, at first, we have to get fully methylated and fully unmethylated sites as gold standard sites, from BS-seq or from M.SssI-treated/PCR-amplified data. Then we need to extract samples for training from the Nanopore reads aligned to the gold standard sites.

Please check out the following cmds and use "deepsignal train --help" for more information. In my tests, I usually need to run at least step 1,2,4,5,6 to train a new model.

# demo cmds for generating training samples
# 1. deepsignal2 extract (extract features from fast5s)
deepsignal2 extract --fast5_dir fast5s/ [--corrected_group --basecall_subgroup --reference_path] --methy_label 1 --motifs CG --mod_loc 0 --write_path samples_CG.hc_poses_positive.tsv [--nproc] --positions /path/to/file/contatining/high_confidence/positive/sites.tsv
deepsignal2 extract --fast5_dir fast5s/ [--corrected_group --basecall_subgroup --reference_path] --methy_label 0 --motifs CG --mod_loc 0 --write_path samples_CG.hc_poses_negative.tsv [--nproc] --positions /path/to/file/contatining/high_confidence/negative/sites.tsv

# 2. randomly select equally number (e.g., 10m) of positive and negative samples
# the selected positive and negative samples then can be combined and used for training, see step 4.
python /path/to/scripts/randsel_file_rows.py --ori_filepath samples_CG.hc_poses_positive.tsv --write_filepath samples_CG.hc_poses_positive.r10m.tsv --num_lines 10000000 --header false &
python /path/to/scripts/randsel_file_rows.py --ori_filepath samples_CG.hc_poses_negative.tsv --write_filepath samples_CG.hc_poses_negative.r10m.tsv --num_lines 10000000 --header false &

# 3. extract balanced negative (or positive) samples if needed
# for example, extract balanced negative samples of each kmer as the number of positive samples of the kmer
python /path/to/scripts/get_kmer_dist_of_feafile.py --feafile samples_CG.hc_poses_positive.r10m.tsv &
python /path/to/scripts/select_neg_samples_by_kmer_distri.py --feafile samples_CG.hc_poses_negative.tsv --krfile samples_CG.hc_poses_positive.r10m.kmer_distri.tsv --totalline 10000000 --wfile samples_CG.hc_poses_negative.b10m.tsv &

# 4. combine positive and negative samples for training
# after combining, the combined file can be splited into two files as training/validating set, see step 5.
python /path/to/scripts/concat_two_files.py --fp1 samples_CG.hc_poses_positive.r10m.tsv --fp2 samples_CG.hc_poses_negative.b10m.tsv --concated_fp samples_CG.hc_poses.rb20m.tsv

# 5. split samples for training/validating
# suppose file "samples_CG.hc_poses.rb20m.*_bilstm.denoise_*.tsv" has 16000000 lines (samples), and we use 160k samples for validation
head -15840000 samples_CG.hc_poses.rb20m.*_bilstm.denoise_*.tsv > samples_CG.hc_poses.rb20m.*_bilstm.denoise_*.train.tsv
tail -160000 samples_CG.hc_poses.rb20m.*_bilstm.denoise_*.tsv > samples_CG.hc_poses.rb20m.*_bilstm.denoise_*.valid.tsv

# 6. train
CUDA_VISIBLE_DEVICES=0 deepsignal2 train --train_file samples_CG.hc_poses.rb20m.*_bilstm.denoise_*.train.tsv --valid_file samples_CG.hc_poses.rb20m.*_bilstm.denoise_*.valid.tsv --model_dir model.dplant.CG --step_interval 1000

Best, Peng

Flower9618 commented 3 years ago

Hi, @Flower9618 , thanks for your interest.

To train a new model of deepsignal2, at first, we have to get fully methylated and fully unmethylated sites as gold standard sites, from BS-seq or from M.SssI-treated/PCR-amplified data. Then we need to extract samples for training from the Nanopore reads aligned to the gold standard sites.

Please check out the following cmds and use "deepsignal train --help" for more information. In my tests, I usually need to run at least step 1,2,4,5,6 to train a new model.

# demo cmds for generating training samples
# 1. deepsignal2 extract (extract features from fast5s)
deepsignal2 extract --fast5_dir fast5s/ [--corrected_group --basecall_subgroup --reference_path] --methy_label 1 --motifs CG --mod_loc 0 --write_path samples_CG.hc_poses_positive.tsv [--nproc] --positions /path/to/file/contatining/high_confidence/positive/sites.tsv
deepsignal2 extract --fast5_dir fast5s/ [--corrected_group --basecall_subgroup --reference_path] --methy_label 0 --motifs CG --mod_loc 0 --write_path samples_CG.hc_poses_negative.tsv [--nproc] --positions /path/to/file/contatining/high_confidence/negative/sites.tsv

# 2. randomly select equally number (e.g., 10m) of positive and negative samples
# the selected positive and negative samples then can be combined and used for training, see step 4.
python /path/to/scripts/randsel_file_rows.py --ori_filepath samples_CG.hc_poses_positive.tsv --write_filepath samples_CG.hc_poses_positive.r10m.tsv --num_lines 10000000 --header false &
python /path/to/scripts/randsel_file_rows.py --ori_filepath samples_CG.hc_poses_negative.tsv --write_filepath samples_CG.hc_poses_negative.r10m.tsv --num_lines 10000000 --header false &

# 3. extract balanced negative (or positive) samples if needed
# for example, extract balanced negative samples of each kmer as the number of positive samples of the kmer
python /path/to/scripts/get_kmer_dist_of_feafile.py --feafile samples_CG.hc_poses_positive.r10m.tsv &
python /path/to/scripts/select_neg_samples_by_kmer_distri.py --feafile samples_CG.hc_poses_negative.tsv --krfile samples_CG.hc_poses_positive.r10m.kmer_distri.tsv --totalline 10000000 --wfile samples_CG.hc_poses_negative.b10m.tsv &

# 4. combine positive and negative samples for training
# after combining, the combined file can be splited into two files as training/validating set, see step 5.
python /path/to/scripts/concat_two_files.py --fp1 samples_CG.hc_poses_positive.r10m.tsv --fp2 samples_CG.hc_poses_negative.b10m.tsv --concated_fp samples_CG.hc_poses.rb20m.tsv

# 5. split samples for training/validating
# suppose file "samples_CG.hc_poses.rb20m.*_bilstm.denoise_*.tsv" has 16000000 lines (samples), and we use 160k samples for validation
head -15840000 samples_CG.hc_poses.rb20m.*_bilstm.denoise_*.tsv > samples_CG.hc_poses.rb20m.*_bilstm.denoise_*.train.tsv
tail -160000 samples_CG.hc_poses.rb20m.*_bilstm.denoise_*.tsv > samples_CG.hc_poses.rb20m.*_bilstm.denoise_*.valid.tsv

# 6. train
CUDA_VISIBLE_DEVICES=0 deepsignal2 train --train_file samples_CG.hc_poses.rb20m.*_bilstm.denoise_*.train.tsv --valid_file samples_CG.hc_poses.rb20m.*_bilstm.denoise_*.valid.tsv --model_dir model.dplant.CG --step_interval 1000

Best, Peng

Thank you very much for your detailed explanation. It is really helpful to me.

PanZiwei commented 2 years ago

Hi @PengNi, Can you specify the location of the get_kmer_dist_of_feafile.py and select_neg_samples_by_kmer_distri.py? I am interested in the imbalance strategy mentioned in the deepsignal-plant but I didn't find the scripts in the repo. Thanks!

PengNi commented 2 years ago

Hi @PanZiwei , you can try this script. balance_samples_of_kmers.py.txt

Best, Peng