Convert raw raman dx data to npy and hdf dataframe

WentongZhou commented 2 years ago

Hi there,

Thanks for this fantastic job. In your package, you use hdf and npy dataframe to store raman data. Could you provide some information on how you convert those raw Raman data (for example dx format data) to those h5 and npy data?

Thanks, Wentong

ALebrun-108 commented 2 years ago

Hi,

First of all, I'm very glad to hear that you like my work and it motivates me to keep on improving it! Now, in response to your question, I usually convert and compile my raw spectra into pandas dataframe, which can then be saved as a numpy array or an hdf5 database. This approach has proven to be an effective way to structure my raw spectra before saving them, and allows me to use various data visualization functions related to pandas. So I think it would be possible to convert your data (dx format data, csv, ...) to pandas dataframe using the pd.read_csv function.

I hope I have answered your question correctly, and I remain available to offer more details if necessary.

Have a nice day,

Alexis Lebrun

WentongZhou commented 2 years ago

Hi, Thanks for your reply. Could you provide your data preparation script (storing multiple spectra ) so that we can try your package more easily? Pandas is a nice way to handle those data, I agree with you.

Thanks Wentong

ALebrun-108 commented 2 years ago

Yes, certainly!

Here is the script I use and I will add it to the package soon :

  def database_creator(directory, class_names=None, nfiles_class=None, checkorder=True, skiprows=2, save_path=None):

      Returns a database (pandas dataframe) containing the spectra and the associated labels, along
      with the array of x-axis values (Raman shift, wavelength, etc.) associated with the spectra.

      Notes:
          Intended to be used on the text files produced by the Raman spectrometer (D-boudreau Lab)

          Parameter nfiles_class is taken into account only if class_names contains more than one class.

          The alphabetical order of the .txt files in the directory must be followed for the lists
          or tupples class_names and nfiles_class.

          Files must follow the following format:
              -First column = wavelenght or Raman shift
              -Other columns = one spectra per column
              -First two rows at the top correspond to the hyperspectral coordinates in x and y

      Parameters:
          directory : string
              Directory (path) of the spectra files(.txt).

          class_names : string, integer, list or tupple, default=None
              Names associated to the classes present in the database.

          nfiles_class : list or tupple, default=None
              Number of (.txt) files in "directory" for each class contained in "class_names".

          checkorder : boolean, default=True
              If true, print the file names in the order used to build the database

          skiprows : integer, default=2
              Number of rows in .txt files to skip. With some exceptions (covid 19 crash), the
              text files exported from the spectroRamanX always include two rows for hyperspectral
              coordinates that need to be removed, which explains the default value of 2.

          save_path : string, default=None
              Path where the dataframe is saved. If None, saving does not occur.
              Recommended format : .h5

      Returns:
          (pandas dataframe) Database with the spectra and their associated labels.

          (array) X-axis(wavenumber, wavelenght, Raman shift, etc.) used for the spectra.
                  Array shape = (n_pixels, ).
      """
      # labels list space allocation
      labels = []
      # retrieves all text files in the given directory
      filenames = [f for f in os.listdir(directory) if f.endswith(".txt")]
      filenames.sort()  # sort files in alphabetical order

      if checkorder is True:
          # pour vérifier l'ordre des fichiers
          print(filenames)

      if class_names is None:
          # no class name is given, labels are set to "unknown" for all spectra
          labels = ['unknown'] * len(filenames)
      elif isinstance(class_names, (str, int)):
          # all spectra belong to the same class, the same label is used for all spectra.
          labels = [class_names] * len(filenames)
      elif isinstance(class_names, (list, tuple)):
          # different classes are used, different labels are given to spectra files
          if len(class_names) == len(nfiles_class):
              for i in range(len(class_names)):
                  x = [class_names[i]] * nfiles_class[i]
                  labels = labels + x
          else:
              raise ValueError('if class_names is a list or a tupple, its number of elements must correspond'
                               'to the number of elements in nfiles_classes')
      # space allocation
      dataframe = pd.DataFrame()
      wn = []

      for (name, lab) in zip(filenames, labels):
          df = pd.read_csv(directory + name,
                           header=skiprows-1,  # header starts after the comment lines
                           decimal='.',
                           sep=',',
                           comment='#')
          df = df.T
          wn_series = df.iloc[0, 0:, ]  # Pandas series to store in a hdf5 database
          wn = df.to_numpy(dtype='float64')[0, 0:]
          df.drop(df.index[0], inplace=True)  # Remove indexes and Raman shift
          df.insert(0, 'Classes', lab)  # Specifies the labels for each spectrum
          dataframe = pd.concat([dataframe, df], ignore_index=True)

      if save_path is not None:
          if save_path.endswith('.h5'):
              # save spectra and wavelengths in an hdf5 database
              dataframe.to_hdf(save_path, key='spectra', mode='w', format='table')
              wn_series.to_hdf(save_path, key='wn')
          else:
              raise ValueError('Invalid extension, it must be \'.h5\'')
      return dataframe, wn

Hopefully this will be useful to you,

Alexis Lebrun

WentongZhou commented 2 years ago

Hi, Thanks for the help! we will cite your work when we publish our results in the near future.

Best, Wentong

renjith-vs commented 2 years ago

@ALebrun-108 Hi Alexis Lebrun, your contribution to those who work in spectrum analysis is great. as @WentongZhou said, it would be great if you provide how to convert the raw spectral data along with metadata (demographic details) such as (name, age, and molecule 1, molecule 2 ...) into data frame .h5.

Anticipating your positive responses. with regards, Renjith

ALebrun-108 commented 2 years ago

Hi ,

Glad to hear you like it too. To answer your question, it is difficult to provide a single code that allows, regardless of the format, to convert the raw spectra into the same format for the application of this code. Please note that the hdf5 format is used here only to allow you to use the spectra we have measured in our lab, and is optional for using this package with your own spectra. In fact, to use this package, you just need to generate numpy arrays from your code (one array for the spectra and another one for the labels).

I also provided above a draft of the code we used to convert our raw spectra ( a custom .txt file with metadata) into dataframe pandas, which can be saved as .hdf5 files and from which you can extract numpy arrays easily. I do note, however, that it might be interesting to provide codes to use the main basic spectra formats used (.spe, .Grams, .spc, etc.), and I plan to add this in the next update.

Hopefully this will be useful to you,

Alexis Lebrun

ALebrun-108 commented 2 years ago

To further elaborate on this, if you look at some of the code examples I've provided on the ReadMe page,

First step here is to import the hdf5 databases in pandas dataframe format. This step is optional and it is possible to produce dataframe directly from your data
```
df = pd.read_hdf('Bile_acids_27_07_2020.h5', key='df')  # Load bile acids dataframe
wn = np.load('Raman_shift_27_07_2020.npy')  # Load Wavenumber (Raman shift)
```

Then we extract the spectra and labels as a numpy array.

# Features extraction: Exports dataframe spectra as a numpy array (value type = float64).
sp = df.iloc[:, 1:].to_numpy()
# Labels extraction: Export dataframe classes into a numpy array of string values.
label = df.loc[:, 'Classes'].values

renjith-vs commented 2 years ago

@ALebrun-108 Thanks for writing me back. For a better understanding, can you show the data format of raw spectral time series data in a .txt file before converting it into data frame. This may even help to analyze how the data is represented against the wave number. Again how do you correlate the spectral intensity against the wavenumber (RamanShift) since the bile acids data frame and Raman shift are taken separately.

Anticipating your positive responses.

ALebrun-108 commented 2 years ago

Sorry for the delay,

To answer your questions, the spectra used in this software are not time series, but rather vibrational spectra, which are measured with a spectrometer calibrated in wavelengths (wavelengths can be converted to Raman shift and are both units used in spectroscopy). Thus, the spectra and the Raman shift are not measured separately, but at the same time.

To simplify the use of the package, you can ignore the Raman shift values as they have no real impact on the operation of the algorithms and are mainly used to associate the Raman spectral bands with specific vibrational groups

To answer your request, I attach below an image of the data format with annotations added afterwards :

github_answer

Hopefully this will be useful to you,

Alexis Lebrun

ALebrun-108 / BoxSERS

Convert raw raman dx data to npy and hdf dataframe #3