conor-horgan / DeepeR

DeepeR: deep learning enabled Raman spectroscopy
MIT License
70 stars 22 forks source link

Questions for: High-throughput molecular imaging via deep learning enabled Raman spectroscopy #2

Open ever4244 opened 1 year ago

ever4244 commented 1 year ago

Dear Dr. Horgan:

Good morning!

I have read your paper High-throughput molecular imaging via deep learning enabled Raman spectroscopy and your code GitHub - conor-horgan/DeepeR: DeepeR: deep learning enabled Raman spectroscopy. They are of great help for my current research, which is about Raman spectroscopy and bacteria metabolism .

I wonder if you can help me in several questions:

  1. In your training dataset 159618X500, how can I transfer the 500-length data array back to actual Raman shift? (What is the formula to project X-axis from 0 to 1800 Ramen Shift(cm-1) to 0-500 data array in the Training_Input?

  2. In our research, a D2O/H2O peak between 1700-2700 Raman shift is important to predict the metabolism of the bacteria. However, in your dataset there is only 500 datapoints (I assume your data’s Raman shift is about 500-1800). Do you have a untruncated dataset that have longer data length (for example data with 500-4000 Raman shift)? We would be very grateful if you can share with us such longer dataset.

image

  1. We have about 10K Raman data for E-coli (500-4000 Raman shift). I am currently considering randomly concatenate our 10K data with longer Raman shift with your 160K data (500-1800 Raman shift), do you think this is a viable solution for data augmentation? Radom sampling the 10K longer Raman shift dataset and concatenate with your 160K shorter Raman shift dataset would produce another 160K+ longer dataset.

  2. What about overfitting? I found my current encoder-decoder DL model tend to memorize the average spectrum from the training set. If we encounter a totally new bacteria, the denoising will produce false image (which is understandable). Encoder-decoder models produce the clear output half by encoder-side information, half by decoder side memorization.

In our use cases, we can often find some new bacterium that is out of the training set on the slices. However, my current model will still produce the spectrum of E-coli (bacteria in the training set) rather than the spectrum of Lactobacillus (new bacteria) in the real test samples.

Do you have any suggestions to prevent the DL model from producing output just by memory, when a incoming sample is clearly a new specimen? It might be better to give up some accuracy to prevent the model from producing every spectrum according to the training set specimen average. Maybe putting more weights on the encoder side information and trying to reduce the memorization capacity on the decoder side? I currently don’t have good solution for this problem, hoping to hear your insight.

Regards! WEI LI

conor-horgan commented 1 year ago

Hi Wei,

I am glad to hear that you are finding this work useful. I've tried to provide some brief answers to your questions below.

  1. I've now added the x-axis information to the datasets hosted on Google Drive.
  2. I unfortunately do not have access to the dataset with an extended Raman shift.
  3. This could work, though potentially there is some deuterium signal in the 500-1800 Raman shift region that is not easily observable which might introduce unwanted bias. To check this you could perform a PCA on your dataset (cropped to 500-1800 Raman shift) and see if there is separation between spectra with/without deuterium.
  4. To prevent overfitting, make sure to separate your dataset into train, evaluation, and test splits and stop training when the performance on the evaluation decreases. In addition, increasing data augmentations can help with overfitting.

Cheers, Conor

ever4244 commented 1 year ago

3. nd see if there is separation between spectra with/without deuterium.

Thank you very much!

I am very grateful for your help.

So far, I haven't find a good solution for the overfitting problem albeit I have done all the things you have suggested. The nature of my task requires my model to be able to reconstruct/denoising cell signiture that is out of all training/testing/validation set (i.e. a completely new cell-type). My current solution is to give up full-length transformer model, limit the scope(context window size) of the CNN filter or use local self-attention, so that the model can only learn short distance feature, proventing it from memorzing the entire cell spectrum. Full length transformer has the best result so far but tend to learn long distance feature, which I think is quite corelated with known cell-types.

I will share with you our research progress in the future once we submit our draft.