LorenFrankLab / rec_to_nwb

Data Migration REC -> NWB 2.0 Service
Other
2 stars 8 forks source link

Loss of precision when converting ephys data from rec to nwb #54

Open rly opened 1 year ago

rly commented 1 year ago

Currently, rec_to_nwb does the following:

  1. calls rec_to_binaries which converts the raw ephys voltage data from .rec to .mda format (dtype = int16; ADC units)
  2. parses the "raw_data_to_volts" key from the metadata YAML. according to a jan 2021 slack message from loren, this value should always be set to 0.000000195 (or 1.95e-7)
  3. multiplies the above value by 1e6 to get the conversion factor from raw to uV (0.195). this matches the value stored in the .rec xml file headers (rawScalingToUv="0.19500000000000001")
  4. multiplies the raw int16 data (in ADC units) from the .mda file by the above value (0.195) and then sets the dtype to int16, which truncates any values after the decimal point (0.99 -> 0)
  5. writes this transformed raw data (now in uV) to an NWB ElectricalSeries object named "e-series" with a 1e-6 conversion factor, used to convert the data to volts

Because of the data transformation in Step 4 above, there is a loss of precision. Let's say the original .rec file data has values:

>>> np.arange(10, dtype="int16")
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int16)

then after multiplying by 0.195 to get the data in uV, the values are:

>>> np.arange(10, dtype="int16") * 0.195
array([0.   , 0.195, 0.39 , 0.585, 0.78 , 0.975, 1.17 , 1.365, 1.56 ,
       1.755])

then after setting the dtype to int16, the values are:

>>> (np.arange(10, dtype="int16") * 0.195).astype("int16")
array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1], dtype=int16)

Note the loss of precision in the resulting output. If the original data has unique values -100 to 100 (201 possible), then the converted NWB file will have unique values -19 to 19 (39 possible). This could have an impact on spike sorting and LFP filtering - probably a small impact, but still I think some impact?

For the above reason, it is more common to store the raw, untransformed int16 ephys data (ADC units) from an acquisition system as the ElectricalSeries data, and store the conversion factor (here: 0.000000195). However, NWB users (such as Spyglass) have to remember to multiple the data by the conversion factor to get the data in volts. (The NWB team is working on improving this messaging...). Note that this makes using the data just a little slower and converting the data just a little faster.

I suggest that the ephys data be stored in the original ADC units, because currently some precision is lost, and the cost of multiplying during use is small.

lfrank commented 1 year ago

This loss of precision is actually not a problem; the noise level of the recordings is several microvolts. Also, is it really useful to know the units of the data and not to have to convert. As such, I'd like for us to keep this as is...

khl02007 commented 1 year ago

@rly @lfrank I support @rly's suggestion. It may be the case that the quantization noise from early conversion is not large, but there is no reason to add noise to our data. It also doesn't require more disk space since everything would remain int16. The only downside seems to be that it is unintuitive, but right now we still have to multiply the data by the value in conversion field of the ElectricalSeries because the unit is volt not microvolt; if we're going to do this, then we might as well multiply by the actual ADC-to-microvolt conversion factor. This can be done automatically in spyglass so the user only sees data in microvolts. Finally, the conversion factor carries useful information: e.g. if it is 0.000000195 then we can infer that the data was acquired with Intan amplifier chips. This can be useful for understanding the data acquisition process later.