Open ever4244 opened 1 year ago
Hi Wei,
I am glad to hear that you are finding this work useful. I've tried to provide some brief answers to your questions below.
Cheers, Conor
3. nd see if there is separation between spectra with/without deuterium.
Thank you very much!
I am very grateful for your help.
So far, I haven't find a good solution for the overfitting problem albeit I have done all the things you have suggested. The nature of my task requires my model to be able to reconstruct/denoising cell signiture that is out of all training/testing/validation set (i.e. a completely new cell-type). My current solution is to give up full-length transformer model, limit the scope(context window size) of the CNN filter or use local self-attention, so that the model can only learn short distance feature, proventing it from memorzing the entire cell spectrum. Full length transformer has the best result so far but tend to learn long distance feature, which I think is quite corelated with known cell-types.
I will share with you our research progress in the future once we submit our draft.
Dear Dr. Horgan:
Good morning!
I have read your paper High-throughput molecular imaging via deep learning enabled Raman spectroscopy and your code GitHub - conor-horgan/DeepeR: DeepeR: deep learning enabled Raman spectroscopy. They are of great help for my current research, which is about Raman spectroscopy and bacteria metabolism .
I wonder if you can help me in several questions:
In your training dataset 159618X500, how can I transfer the 500-length data array back to actual Raman shift? (What is the formula to project X-axis from 0 to 1800 Ramen Shift(cm-1) to 0-500 data array in the Training_Input?
In our research, a D2O/H2O peak between 1700-2700 Raman shift is important to predict the metabolism of the bacteria. However, in your dataset there is only 500 datapoints (I assume your data’s Raman shift is about 500-1800). Do you have a untruncated dataset that have longer data length (for example data with 500-4000 Raman shift)? We would be very grateful if you can share with us such longer dataset.
We have about 10K Raman data for E-coli (500-4000 Raman shift). I am currently considering randomly concatenate our 10K data with longer Raman shift with your 160K data (500-1800 Raman shift), do you think this is a viable solution for data augmentation? Radom sampling the 10K longer Raman shift dataset and concatenate with your 160K shorter Raman shift dataset would produce another 160K+ longer dataset.
What about overfitting? I found my current encoder-decoder DL model tend to memorize the average spectrum from the training set. If we encounter a totally new bacteria, the denoising will produce false image (which is understandable). Encoder-decoder models produce the clear output half by encoder-side information, half by decoder side memorization.
In our use cases, we can often find some new bacterium that is out of the training set on the slices. However, my current model will still produce the spectrum of E-coli (bacteria in the training set) rather than the spectrum of Lactobacillus (new bacteria) in the real test samples.
Do you have any suggestions to prevent the DL model from producing output just by memory, when a incoming sample is clearly a new specimen? It might be better to give up some accuracy to prevent the model from producing every spectrum according to the training set specimen average. Maybe putting more weights on the encoder side information and trying to reduce the memorization capacity on the decoder side? I currently don’t have good solution for this problem, hoping to hear your insight.
Regards! WEI LI