EnzymeML / PyEnzyme

🧬 - Data management and modeling framework based on EnzymeML.
BSD 2-Clause "Simplified" License
21 stars 9 forks source link

Problems with data export from example file #58

Open danolson1 opened 1 year ago

danolson1 commented 1 year ago

When I import the EnzymeML_Template_Example.xlsm file into an EnzymeMLDocument using pyenzyme, and then try to look at the data, I get a dataframe with the "absorbance" and "concentration" data concatenated on top of each other. I have a couple of questions about this:

  1. Is this the expected behavior? I would have expected to get a dataframe that roughly corresponded to the input data from the excel template file. In this case, one column for time, and two columns for the pyruvate (species s0) data, one corresponding to concentration, and one to absorbance.
  2. If you have a datatype for absorbance, is there any place to store information about the absorbance wavelength?
  3. From an EnzymeMLDocument object, how do I find the data type? I would have expected this information to be accessible from the measurement_dict object.

Regards, Dan

JR-1991 commented 1 year ago

Hi Dan! Thanks for submitting the issue and your questions. Happy to answer your questions:

Is this the expected behavior? I would have expected to get a dataframe that roughly corresponded to the input data from the excel template file. In this case, one column for time, and two columns for the pyruvate (species s0) data, one corresponding to concentration, and one to absorbance.

This is expected behavior but has been implemented in aid of the modeling platforms we are communicating to. I am happy to add a flag that disables this behavior and results in species columns side by side.

If you have a datatype for absorbance, is there any place to store information about the absorbance wavelength?

To this point, there is no place to add the wavelength of an absorbance to EnzymeML, but this is a current work in progress and will be implemented soon.

From an EnzymeMLDocument object, how do I find the data type? I would have expected this information to be accessible from the measurement_dict object.

The data_type information is tied to the Replicate object, which is a container for the measured values of a species. The Measurement object on the other hand represents a set of Replicates and initial concentrations. Hence, you can access the individual data types by getting the replicates. Here is an example that uses the EnzymeML_template_example.xlsm spreadsheet:

# Get the measurement with the id "m1"
measurement = enzmldoc.getMeasurement("m1")

# Get the reactant with the id "s0"
s1 = measurement.getReactant("s0")

# Finally, get all replicates and print their data types
for replicate in s1.replicates:
    print(replicate.data_type)

# Out:
#     DataTypes.ABSORPTION
#     DataTypes.CONCENTRATION

Would you prefer the DataFrame export to filter certain data types? This way there wouldn't be a mix up of different types.

All the best, Jan

danolson1 commented 1 year ago

I think that only concentration data should be exported by default, since that is the data that is most likely to be used by people other than the creator of the EnzymeML document.

Without wavelength information, the absorbance data does not seem particularly useful. I agree that it should be saved for archival purposes, but this is one more reason to exclude it from the default export.

If the absorbance data is exported, it should be exported as a separate column. To me, having one column that contains both the expected concentration data, and the unexpected (and differently scaled) absorbance data, seems very confusing.