Training motif specific models

pterzian commented 5 years ago

Hi,

I would like to use deepsignal to detect 6mA modifications and I would need some clarification about the process because I am not sure of what I understood.

From previous questions I understood I should start with training a model.

I actually have runs that cover two samples of a bacteria, one is native so I consider the motif I am interested in to be fully methylated, the other is a mutant for which we assessed the absence of methylation at the same motif.

Here is the process I have in mind :

Blend fast5 from both samples
Split the blended fast5 in two so I have a --train_file dataset and a --valid_file dataset ready for extraction
Unsing deepsignal extract on both group of blended fast5 with (ex:) --motifs GATC & --mod_loc 1 (1 would be the position of A in GATC) At this step I don't understand the use of --methy_label. How should I make a choice when I just blended positive and negative sample ?
Give extraction output tsv files to the deepsignal train command ?

I feel I am missing something here!

Thanks in advance! Paul.

PengNi commented 5 years ago

Hi @pterzian ,

Thanks for your interest.

So when using the deepsignal, we don't blend the fast5 files first. The correct order is extracting features from both samples first, then blending the output tsv files for training.
--methy_label is used to classify positive and negativate samples during training. Normally we can set --methy_label to 1 when extracting features from native sample, and 0 when extracting features from control samples.
concat_two_files.py may be helpful to shuffle two files.
Once we get the shuffle-concated file, we can split it into training file and validating file. 10k samples may be enough for validation. Also, the ratio of positive and negative samples for training and validating is suggested to be set as ~1:1.

Best, Peng

pterzian commented 5 years ago

Thank you Peng, it is much clearer to me now. I'll keep you posted on how it went ! (probably next week)

pterzian commented 5 years ago

So it has been a couple of days that the training is going on and I am following the process through the train.txt and valid.txt logs but I am not sure of how I should interpret it.

This is the command :

deepsignal train --train_file training_file \ 
      --valid_file validating_file \
      --model_dir models/  \
      --log_dir logs

This is a snippet of the train.txt log :

epoch:0, iterid:100, loss:3.228, accuracy:0.522, recall:0.292, precision:0.466
epoch:0, iterid:200, loss:0.768, accuracy:0.551, recall:0.288, precision:0.531
epoch:0, iterid:300, loss:0.689, accuracy:0.605, recall:0.385, precision:0.627

And this is for the valid.txt log :

epoch:0, iterid:100, loss:0.825, accuracy:0.548, recall:0.114, precision:0.541
epoch:0, iterid:200, loss:0.708, accuracy:0.612, recall:0.608, precision:0.574
epoch:0, iterid:300, loss:0.631, accuracy:0.679, recall:0.520, precision:0.706

So these are only the first 3 lines and I am actually at the 7th iteration (iterid:700).

My first question would be how much iteration should I expect until the model is ready?
Following that I would love to understand why there are two log files sharing the same variables with different values. Is deepsignal swapping the roles of the training and validation dataset in order to find the best one to train on ?
My last question would be : Can I try the model already ?

Thank you for your time! Paul

PengNi commented 5 years ago

Hi Paul,

So by default, deepsignal will train 5 epoches at least and 10 epoches at most. The training may stop after any epoch (5 to 10) finished, because we use early-stopping strategy. One epoch will go over all training dataset once.
At each iteration, we will use the current model to predict: (1) all training dataset at this iteration (51200 samples by default); (2) the validation dataset. And we log the prediction performances to train.txt and valid.txt separately. We log both of them just for comparison, because we use Dropout during training. You can also check out the stdout of the training.

Normally the performance on the validation dataset is closer to the performance on a test dataset.
In my experience, we may need to train at least 5 epoches to get a stable model.

Also, can you tell me how many samples you have in the validation dataset? Based on the tests I did, 10k samples are enough for validation (to get a stable model). Too many samples in validation dataset will make the training process much slower.

Best, Peng

pterzian commented 5 years ago

Hi Peng,

I understand that my validation dataset was too big. If by samples you mean the number of lines in my validation_file and training_file, I had around 1M samples for each file.

I did launch a new run with your last instructions. I reduced the validation_file to 10k lines with a ratio of ~1:1 negative and positive samples. I also reduced my training_file to 500k lines, with the same ratio.

Indeed it seems to run faster.

I would have two questions :

I found the path of a supposed model in the checkpoint file of the models/ folder :
```
model_checkpoint_path: "bn_17.sn_360.epoch_0.ckpt"
all_model_checkpoint_paths: "bn_17.sn_360.epoch_0.ckpt"
```
I used this path to call modifications on a dataset and it seems to be working, yet I don't see the actual file in the folder. All I see in the folder is :
```
bn_17.sn_360.epoch_0.ckpt.data-00000-of-00001  
bn_17.sn_360.epoch_0.ckpt.index  
bn_17.sn_360.epoch_0.ckpt.meta  
checkpoint
```
I guess I am not sure of what I'm doing here.
This question is more about understanding the algorithm. Following what you said about iterations and epoch and given the number of lines in my training_file, I should expect around 10 iterations per epoch right (with default parameters) ?

Paul.

PengNi commented 5 years ago

Hi Paul,

So theoretically we should use as many samples for training as possible. In my case I use 20m samples for training, 10k samples for validtion.

(1) "bn_17.sn_360.epoch_0.ckpt" actually is prefix of those file names. It is a tensorflow feature. The prefix is used to indicate those files in the folder, all together as a model (a bunch of parameters).

(2) Yes. Say there are 500k samples for training, then there will be around 10 iterations per epoch. And normally we should expect a stable model after 5-10 epoches. But not limited to 5-10, these also can be tuned to get a better model.

Best, Peng

pterzian commented 5 years ago

Hi Peng,

Sorry for delaying my answer so much, I was not around. Following your indication we manage to obtain very satisfying results (5 epoches worked well as you said). We will definitely continue trying deepsignal with other datasets, so I will be back for sure.

Thanks a bunch for the support !

pterzian commented 4 years ago

Hi Peng,

I am taking the liberty to reopen this topic because I am back to model training and I have new questions. I am currently using 10M samples for training and I was wondering how much time it took for your model to train on 20M samples ? Did you do it on gpu ?

From what I see on my cpu machine, 10M samples is around 200 iterations the epoch and will definitly take at least 3/4 weaks to complete 5 epoches.

best, Paul

PengNi commented 4 years ago

Hi Paul,

I used one GPU (TITAN X (Pascal)). It costs about 44 hours for training on 20M samples. we don't suggest training on cpu. However, converting training and validation file from txt format to binary format will speed up the training process (see --is_binary option of deepsignal train).

Best, Peng

pterzian commented 4 years ago

Hi Peng,

Thanks for the answer. Actually we have a few gpu machine available so I am up to try training this way. I will open a new issue for training model and calling modifications with gpu, I tried this last one with no success a couple of weeks ago.

bioinfomaticsCSU / deepsignal

Training motif specific models #13