Closed pterzian closed 4 years ago
Hi @pterzian ,
Thanks for your interest.
deepsignal
, we don't blend the fast5 files first. The correct order is extracting features from both samples first, then blending the output tsv files for training.--methy_label
is used to classify positive and negativate samples during training. Normally we can set --methy_label
to 1 when extracting features from native sample, and 0 when extracting features from control samples. Best, Peng
Thank you Peng, it is much clearer to me now. I'll keep you posted on how it went ! (probably next week)
So it has been a couple of days that the training is going on and I am following the process through the train.txt
and valid.txt
logs but I am not sure of how I should interpret it.
This is the command :
deepsignal train --train_file training_file \
--valid_file validating_file \
--model_dir models/ \
--log_dir logs
This is a snippet of the train.txt
log :
epoch:0, iterid:100, loss:3.228, accuracy:0.522, recall:0.292, precision:0.466
epoch:0, iterid:200, loss:0.768, accuracy:0.551, recall:0.288, precision:0.531
epoch:0, iterid:300, loss:0.689, accuracy:0.605, recall:0.385, precision:0.627
And this is for the valid.txt
log :
epoch:0, iterid:100, loss:0.825, accuracy:0.548, recall:0.114, precision:0.541
epoch:0, iterid:200, loss:0.708, accuracy:0.612, recall:0.608, precision:0.574
epoch:0, iterid:300, loss:0.631, accuracy:0.679, recall:0.520, precision:0.706
So these are only the first 3 lines and I am actually at the 7th iteration (iterid:700
).
My first question would be how much iteration should I expect until the model is ready?
Following that I would love to understand why there are two log files sharing the same variables with different values. Is deepsignal swapping the roles of the training and validation dataset in order to find the best one to train on ?
My last question would be : Can I try the model already ?
Thank you for your time! Paul
Hi Paul,
So by default, deepsignal
will train 5 epoches at least and 10 epoches at most. The training may stop after any epoch (5 to 10) finished, because we use early-stopping strategy. One epoch will go over all training dataset once.
At each iteration, we will use the current model to predict: (1) all training dataset at this iteration (51200 samples by default); (2) the validation dataset. And we log the prediction performances to train.txt
and valid.txt
separately. We log both of them just for comparison, because we use Dropout during training. You can also check out the stdout
of the training.
Normally the performance on the validation dataset is closer to the performance on a test dataset.
In my experience, we may need to train at least 5 epoches to get a stable model.
Also, can you tell me how many samples you have in the validation dataset? Based on the tests I did, 10k samples are enough for validation (to get a stable model). Too many samples in validation dataset will make the training process much slower.
Best, Peng
Hi Peng,
I understand that my validation dataset was too big. If by samples you mean the number of lines in my validation_file
and training_file
, I had around 1M samples for each file.
I did launch a new run with your last instructions. I reduced the validation_file
to 10k lines with a ratio of ~1:1 negative and positive samples. I also reduced my training_file to 500k lines, with the same ratio.
Indeed it seems to run faster.
I would have two questions :
I found the path of a supposed model in the checkpoint
file of the models/
folder :
model_checkpoint_path: "bn_17.sn_360.epoch_0.ckpt"
all_model_checkpoint_paths: "bn_17.sn_360.epoch_0.ckpt"
I used this path to call modifications on a dataset and it seems to be working, yet I don't see the actual file in the folder. All I see in the folder is :
bn_17.sn_360.epoch_0.ckpt.data-00000-of-00001
bn_17.sn_360.epoch_0.ckpt.index
bn_17.sn_360.epoch_0.ckpt.meta
checkpoint
I guess I am not sure of what I'm doing here.
This question is more about understanding the algorithm. Following what you said about iterations and epoch and given the number of lines in my training_file, I should expect around 10 iterations per epoch right (with default parameters) ?
Paul.
Hi Paul,
So theoretically we should use as many samples for training as possible. In my case I use 20m samples for training, 10k samples for validtion.
(1) "bn_17.sn_360.epoch_0.ckpt" actually is prefix of those file names. It is a tensorflow feature. The prefix is used to indicate those files in the folder, all together as a model (a bunch of parameters).
(2) Yes. Say there are 500k samples for training, then there will be around 10 iterations per epoch. And normally we should expect a stable model after 5-10 epoches. But not limited to 5-10, these also can be tuned to get a better model.
Best, Peng
Hi Peng,
Sorry for delaying my answer so much, I was not around. Following your indication we manage to obtain very satisfying results (5 epoches worked well as you said). We will definitely continue trying deepsignal with other datasets, so I will be back for sure.
Thanks a bunch for the support !
Hi Peng,
I am taking the liberty to reopen this topic because I am back to model training and I have new questions. I am currently using 10M samples for training and I was wondering how much time it took for your model to train on 20M samples ? Did you do it on gpu ?
From what I see on my cpu machine, 10M samples is around 200 iterations the epoch and will definitly take at least 3/4 weaks to complete 5 epoches.
best, Paul
Hi Paul,
I used one GPU (TITAN X (Pascal)). It costs about 44 hours for training on 20M samples. we don't suggest training on cpu. However, converting training and validation file from txt format to binary format will speed up the training process (see --is_binary
option of deepsignal train
).
Best, Peng
Hi Peng,
Thanks for the answer. Actually we have a few gpu machine available so I am up to try training this way. I will open a new issue for training model and calling modifications with gpu, I tried this last one with no success a couple of weeks ago.
Hi,
I would like to use deepsignal to detect 6mA modifications and I would need some clarification about the process because I am not sure of what I understood.
From previous questions I understood I should start with training a model.
I actually have runs that cover two samples of a bacteria, one is native so I consider the motif I am interested in to be fully methylated, the other is a mutant for which we assessed the absence of methylation at the same motif.
Here is the process I have in mind :
--train_file
dataset and a--valid_file
dataset ready for extractiondeepsignal extract
on both group of blended fast5 with (ex:)--motifs GATC
&--mod_loc 1
(1 would be the position ofA
inGATC
) At this step I don't understand the use of--methy_label
. How should I make a choice when I just blended positive and negative sample ?I feel I am missing something here!
Thanks in advance! Paul.