Psy-Fer / deeplexicon

Signal based nanopore RNA demultiplexing with convolutional neural networks
https://psy-fer.github.io/deeplexicon/
MIT License
34 stars 8 forks source link

Recreating training model #24

Open bnpapas opened 1 year ago

bnpapas commented 1 year ago

I have been attempting to use the fast5 data provided with the manuscript to train a model to call the same 4 barcodes as "resnet20-final.h5". I've used mapping information to assign barcodes, and if I use the given model with deeplexicon the agreement with my truth table is excellent. I've tried taking 40k reads from each barcode as a training set, with 10k from each as test and validation sets. The training runs, seemingly without issue, however it shows some behavior I don't understand.

  1. Even when the reported accuracy on the training set crosses 0.9, the validation accuracy hovers around 0.5. I've even completed a test where I've had my validation set be a subset of the training reads, and this still occurs.
  2. Using the final model output by the training yields terrible results, even when used on the training set.

Note: I have been using the docker image provided by pulling lpryszcz/deeplexicon:1.2.0-gpu, with "deeplexicon_multi.py train" having default options. Do you have any suggestions how I can improve the model training results?

enovoa commented 1 year ago

Hi @bnpapas - Are you segmenting the fast5?

bnpapas commented 1 year ago

I am following the instructions posted here: https://psy-fer.github.io/deeplexicon/train/ I'm not sure which step would be segmentation?

noncodo commented 1 year ago

You may need to segment the data a priori, e.g. by running python3 deeplexicon.py dmux This will split the signal to separate the barcodes from the RNA. Then train on the segmented barcode output.

On Mar 13, 2023, at 10:15 AM, bnpapas @.***> wrote:

I am following the instructions posted here: https://psy-fer.github.io/deeplexicon/train/ https://psy-fer.github.io/deeplexicon/train/ I'm not sure which step would be segmentation?

— Reply to this email directly, view it on GitHub https://github.com/Psy-Fer/deeplexicon/issues/24#issuecomment-1466226058, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDCR37TRXRHAKBGBBFT273W34TZTANCNFSM6AAAAAAVWZHVLA. You are receiving this because you are subscribed to this thread.

bnpapas commented 1 year ago

The goal here is to be able to train a new model with an eye towards possibly adding new barcodes - I won't be able to use dmux first in a real use case. The truth table files I've assembled are based on mapping information, as was done in the publication. The match between these truth tables and the dmux results from "resnet20-final.h5" is very good.

Edit: To make sure it is clear, I am using the python version of the training code, which uses the "dRNA_segmenter" function to segment reads prior to image generation and subsequent training.

bnpapas commented 1 year ago

When dmux is assigning barcodes, it uses the "classify" function. This function does a transform of the data:

  x = image.astype('float32') + 1
  x = x / 2

The training subcommand, however, does not take this step and trains directly on the images. I've removed the transform from "classify" and now my freshly-trained models produce sensible results with dmux. I assume I can get similar behavior by adding the transform into the train subroutine. Is there a reason to think having this transformation is better than not?

Psy-Fer commented 1 year ago

I think that was added (meant to be on both), to avoid a zero divide error to make it 1 indexed. Sorry been a while since I wrote that.

fulaibaowang commented 1 year ago

You may need to segment the data a priori, e.g. by running python3 deeplexicon.py dmux This will split the signal to separate the barcodes from the RNA. Then train on the segmented barcode output. On Mar 13, 2023, at 10:15 AM, bnpapas @.***> wrote: I am following the instructions posted here: https://psy-fer.github.io/deeplexicon/train/ https://psy-fer.github.io/deeplexicon/train/ I'm not sure which step would be segmentation? — Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDCR37TRXRHAKBGBBFT273W34TZTANCNFSM6AAAAAAVWZHVLA. You are receiving this because you are subscribed to this thread.

would you mind sharing the code? I see deeplexicon_multi.py squig for getting the segmetation but how to would you "split the signal to separate the barcodes from the RNA"?