danielhrisca / asammdf

Fast Python reader and editor for ASAM MDF / MF4 (Measurement Data Format) files
GNU Lesser General Public License v3.0
612 stars 216 forks source link

Different result with iter_to_dataframe and to_dataframe #893

Closed alex-ruehe closed 9 months ago

alex-ruehe commented 10 months ago

Python version

Please run the following snippet and write the output here

('python=3.11.0 (main, Nov  7 2022, 20:37:44) [Clang 14.0.0 '
 '(clang-1400.0.29.202)]')
'os=macOS-13.4.1-x86_64-i386-64bit'
'numpy=1.25.2'
ldf is not supported
xls is not supported
xlsx is not supported
yaml is not supported
'asammdf=7.3.14'

Code

MDF version

4.10

Code snippet

from asammdf import MDF

# Load mdf
mdf = MDF("example_iter_to_dataframe_bug.scrambled.mf4")

# Configure mdf according to PO instructions
mdf.configure(float_interpolation=0, integer_interpolation=0)

iter_df_list = []

# we set raw=False, but still get raw values in the resulting dataframe here
for df_mdf_iter in mdf.iter_to_dataframe(
    raster=0.05,
    reduce_memory_usage=True,
    ignore_value2text_conversions=True,
    time_from_zero=False,
    raw = False
):
    iter_df_list.append(df_mdf_iter)

# if we use to_dataframe on same mdf we get correct values
to_dataframe = mdf.to_dataframe(
    raster=0.05,
    reduce_memory_usage=True,
    ignore_value2text_conversions=True,
    time_from_zero=False,
    raw = False
    )

# we would except same data for both dataframes
print(to_dataframe['QUTUAIJHDBXVIFDXXULUXRA'].sample)
print(iter_df_list[0]['QUTUAIJHDBXVIFDXXULUXRA'].sample)

Traceback

None

Description

The fastest way to debug is to have the original file. For data protection you can use the static method scramble to scramble all text blocks, and send the scrambled file by e-mail.

Hi, if you run the above code snippet, we would expect the data frames to look identical. But they differ, the iter_to_dataframe just contains the raw values and does not run the conversion. We had a quick look at the source code and think that the issue comes from https://github.com/danielhrisca/asammdf/blob/master/src/asammdf/mdf.py#L4231 - we guess that should be <= and not <. You can use the attached example file

example_iter_to_dataframe_bug.scrambled.mf4.zip

jlyda commented 9 months ago

The conversion_type of the signal is 7 == CONVERSION_TYPE_TABX == "value to text/scale conversion tabular look-up" when ignore_value2text_conversions == True then the conversion is ignored. But actually the conversion contains a default conversion in the conversion.referenced_blocks which has the conversion_type set to 1 == CONVERSION_TYPE_LIN. So the signal conversion should not be ignored. I prepared a PR to also evaluate the underlying default conversion type: #909

danielhrisca commented 9 months ago

The conversion_type of the signal is 7 == CONVERSION_TYPE_TABX == "value to text/scale conversion tabular look-up" when ignore_value2text_conversions == True then the conversion is ignored. But actually the conversion contains a default conversion in the conversion.referenced_blocks which has the conversion_type set to 1 == CONVERSION_TYPE_LIN. So the signal conversion should not be ignored. I prepared a PR to also evaluate the underlying default conversion type: #909

Any of the referenced blocks in the TABX can be another conversion, not just the default one. You can have multiple nested layers of conversion actually so how do draw the line if not by completely ignoring TABX conversions all together.

jlyda commented 9 months ago

@danielhrisca I checked the standard and in MDF4 both channel conversions of type 7 and 8 do not have to be exclusively a text conversion, but can be a scale conversion with tabular look-up cause it says "... to text/scale conversion ...". So I still think for 7 and 8 its valid to check at least the first level or better applying the check recursively for ignore_value2text_conversions=True and only skip if you find a "real" text conversion between conversion types 9 - 11. Anyway, I will create a PR to refactor the current duplicated code for iter_to_dataframe and to_dataframe for this part of code and then I can override the check myself with my own logic that fits and solves my problems for the moment.

danielhrisca commented 9 months ago

Please try the latest development branch code. With this code if the converted array contains only text values (all the raw values were converted to texts), or if a value is np.nan (one of the raw values was converted to text) then the converted array is ignored and the raw signal samples are returned. Another option would be to apply the numeric conversions and return the raw values for texts

danielhrisca commented 9 months ago

Another option would be to apply the numeric conversions and return the raw values for texts

This is now the default implementation

from asammdf.blocks.conversion_utils import from_dict
import numpy as np

c = from_dict(
    {
        'val_0': 0, 
        'text_0': 'TEXT0', 
        'default_addr': {
            'val_0': 1, 
            'text_0': 'TEXT1', 
            'default_addr': {
                'a': 2, 
                'b': 0}
        }
    }
)

print(c.convert(np.array([1, 2, 3, 0, 4,0, 1]), ignore_value2text_conversions=True))

returns

[1. 4. 6. 0. 8. 0. 1.]
jlyda commented 9 months ago

Great @danielhrisca! This is the perfect and sustained change I needed and fixes this issue! Do you know or better when do you expect this could be released?

danielhrisca commented 9 months ago

There will be a release by the end of next week

jlyda commented 9 months ago

There will be a release by the end of next week

Thanks a lot for the quick reply! That would be amazing! Keep up the great work!