Working with multidimensional curves in DLIS files

lucasblanes commented 3 months ago

Hello! I apologize if this is a topic that has already been covered here (I looked in the closed issues and didn't find exactly what I want), but I would like to know what is the fastest and most efficient way to extract multidimensional (non-scalar) curves and save them in a numpy array or a dataframe.

I am working with an NMR DLIS file that has both 1D curves and multidimensional curves (T2 distribution, for example) in the same frame and, when I try to extract them to a dataframe, I get errors saying that it is not possible to extract the distribution of T2 because it is non-scalar. In the DLISio documentation there is a way to quickly transform the curves into a dataframe if there is only scalar data:

As long as the frame only contains channels with scalar samples, it can be trivially converted to a pandas DataFrame:

import pandas as pd
curves = pd.DataFrame(frame.curves())

Source: https://dlisio.readthedocs.io/en/latest/dlis/userguide.html

Is there a function that encompasses multidimensional curves? If not, could someone post a code here that would do this quickly and save it in a dataframe or in a numpy array? I didn't find this on the internet anywhere.

Detail: I need all frame curves to go to the same dataframe or numpy array. It doesn't help me to create two separate objects for the 1D and multidimensional curves.

Thanks!

achaikou commented 3 months ago

Hi!

No, I don't think this question has been asked before.

You are right, it seems impossible to easily read all multidimensional curves into the same pandas dataframe. From here:

Note that pandas (and CSV) only supports scalar sample values. I.e. frames containing one or more channels that have none-scalar sample values cannot be converted to pandas.DataFrame or CSV directly.

I am not a pandas specialist, so I don't know what are possible workarounds for this.

However you say that numpy array is enough. frame.curves() already returns numpy.ndarray. Is there something preventing you from using it directly, without any additional conversion?

lucasblanes commented 3 months ago

Thanks for the answer Alena. I am currently being able to save the numpy array that the frame.curves() returns in a pandas dataframe cell:

2024-06-25 09_53_23-DataFrame editor

The problem is that I do this iteratively and this is quite time consuming for DLIS that contain many multidimensional arrays, such as image logs or wireline formation test. My code is this one:

def summary_curve_values(df_in, curve_index='Curvas', unit_index='Unidade', nan_value=-999.25, verbose=True):
    values = []
    mins   = []
    maxs   = []
    means  = []
    median = []
    for i in range(len(df_in)):
        if verbose:
            print('Starting curve ' + str(i+1) + ' of ' + str(len(df_in)) + '.')
        if df_in.loc[i,unit_index] == 'meters':
            curve = df_in.loc[i,curve_index]() * 0.00254
            curve[curve == nan_value] = np.nan
            values.append(curve)
            mins.append(  np.nanmin( curve))
            maxs.append(  np.nanmax( curve))
            means.append( np.nanmean(curve))
            median.append(np.nanmedian(curve))
        else:
            curve = df_in.loc[i,curve_index]()
            if 'int' in str(curve.dtype):
                curve = np.float64(curve)
            curve[curve == nan_value] = np.nan
            values.append(curve)
            mins.append(  np.nanmin( curve))
            maxs.append(  np.nanmax( curve))
            means.append( np.nanmean(curve))
            median.append(np.nanmedian(curve))
    df_in['Curvas']  = values
    df_in['Mínimo']  = mins
    df_in['Máximo']  = maxs
    df_in['Média']   = means
    df_in['Mediana'] = median

A Comment - I first get the frame.curves object in the Dataframe cell and then, if the user wants, I open it through the frame.curves() function. I do this because I first bring the DLIS information to the user and if they want to load the curves, it runs the function above to extract the curves. Thus the code does not spend time if the user does not want to load the curves.

In short, my problem is now related to efficiency. The frame.curves() function is very slow to use in a loop for. I would need code that was faster. Does anyone have an idea of using pd.DataFrame(Frame.curves()) to extract the 1D curves and then run only in the N-D curves to take the multidimensional curves?

achaikou commented 3 months ago

The frame.curves() function is very slow to use in a loop for.

I think you shouldn't use frame.curves() inside the loop. You should call it just once after user indicated their wish to load any curve, then cache the result all_curves = frame.curves() and use this full all_curves ndarray as a source to populate your 1-D pandas frame if you need to.

Single call to frame.curves() will load all the curves in the frame. Extracting curves separately (like 1-D, 1-D, 1-D...) channel-by-channel with frame.channels[i].curves() is very likely to be much slower:

Due to the memory-layout of dlis-files, reading a single channel from disk and reading the entire frame is almost equally fast. That means reading channels from the same frame one-by-one with this method is way slower than reading the entire frame with Frame.curves() and then indexing on the channels-of-interest.

I think we actually read all the curves together anyway, even if just one was requested, we just return only one curve value. So one call to frame.curves() seems to be the only option in your case.

lucasblanes commented 3 months ago

I did it here and it worked perfectly! I recover all curves from an image log DLIS in 0.62 minutes, compared to 18 minutes previously. Thank you very much!

lucasblanes commented 3 months ago

Just for the record for other colleagues, the code I use to extract these curves is as follows:

def summary_curve_all(df_in, frame_in, nan_value=-999.25):
    curves = frame.curves()
    curves = curves.tolist()

    curve = []
    all_curves = []
    all_mins   = []
    all_maxs   = []
    all_means  = []
    all_median = []

    for i in range(1,len(curves[0])):
        for j in range(len(curves)):
            curve.append(curves[j][i])
        if df_in.loc[i-1,'Unidade'] == 'meters':
            curve_to_append = np.array(curve) * 0.00254
            if 'int' in str(curve_to_append.dtype):
                curve_to_append = np.float64(curve_to_append)
            curve_to_append[curve_to_append == nan_value] = np.nan
            all_curves.append(             curve_to_append)
            all_mins.append(  np.nanmin(   curve_to_append))
            all_maxs.append(  np.nanmax(   curve_to_append))
            all_means.append( np.nanmean(  curve_to_append))
            all_median.append(np.nanmedian(curve_to_append))
        else:
            curve_to_append = np.array(curve)
            if 'int' in str(curve_to_append.dtype):
                curve_to_append = np.float64(curve_to_append)
            curve_to_append[curve_to_append == nan_value] = np.nan
            all_curves.append(             curve_to_append)
            all_mins.append(  np.nanmin(   curve_to_append))
            all_maxs.append(  np.nanmax(   curve_to_append))
            all_means.append( np.nanmean(  curve_to_append))
            all_median.append(np.nanmedian(curve_to_append))
        curve = []
    df_in['Curvas']  = all_curves
    df_in['Mínimo']  = all_mins
    df_in['Máximo']  = all_maxs
    df_in['Média']   = all_means
    df_in['Mediana'] = all_median

equinor / dlisio

Working with multidimensional curves in DLIS files #436