SIESTA-eu / wp15

work package 15, use case 2
0 stars 2 forks source link

Privacy assessment .tsv files. #38

Open EmiKib opened 1 month ago

EmiKib commented 1 month ago

An addition to the .tsv scrambler could be to look into how much leakage there is in a scrambled dataset.

Simple metrics could be full row leakage & partial row leakage with the option to set a value representing NaN.

def detect_row_leakage(df_original, df_scrambled, ignore_value= value):

    if df_original.shape != df_scrambled.shape:
        print("DataFrames do not have the same shape.")
        return

    total_rows = df_original.shape[0]  
    total_columns = df_original.shape[1]  
    partial_leakage_count = 0  
    full_leakage_count = 0  
    matching_cells_per_row = []  

    for idx, (row_orig, row_scram) in enumerate(zip(df_original.iterrows(), df_scrambled.iterrows())):
        row_orig = row_orig[1]
        row_scram = row_scram[1]
        valid_mask = (row_orig != ignore_value) & (row_scram != ignore_value)  

        # Count matches where both values are valid (not ignore_value)
        matches = (row_orig[valid_mask] == row_scram[valid_mask])
        match_count = matches.sum()
        matching_cells_per_row.append(match_count)

        # Check for full leakage
        if match_count == valid_mask.sum():
            full_leakage_count += 1
        # Check for partial leakage
        elif 0 < match_count < valid_mask.sum():
            partial_leakage_count += 1

    partial_leakage_percentage = (partial_leakage_count / total_rows) * 100 if total_rows > 0 else 0
    full_leakage_percentage = (full_leakage_count / total_rows) * 100 if total_rows > 0 else 0

    avg_matching_cells_per_row = np.mean(matching_cells_per_row) if total_rows > 0 else 0
    std_matching_cells_per_row = np.std(matching_cells_per_row) if total_rows > 0 else 0

    # Print the results
    print(f"Percentage of rows with partial leakage: {partial_leakage_percentage:.2f}%")
    print(f"Percentage of rows with full leakage: {full_leakage_percentage:.2f}%")
    print(f"Average number of matching cells per row: {avg_matching_cells_per_row:.2f} / {total_columns} fields per row")
    print(f"Standard deviation of matching cells per row: {std_matching_cells_per_row:.2f}")

In addition running a comparison between the Privacy Information Factor (implemented in the metaprivBIDS application) specifically the Field Information gain, would give an overview over how the original and srambled dataset assess the information gain for the individual columns. In that way the user has an overview over much information is retained in their scrambled dataset compared to their new. This would most often be almost the same as the permutation is row based, but it gives a quantifiable metric the user can refer back to.

Lastly a comparison of the scrambled and original datasets correlation matrix between columns could be an option to ensure that the dataset has been permuted to a satisfactory level.

I hope this could be useful? If so, I could try to add it so it can run with the pipeline or as a separate pipeline.

schoffelen commented 3 weeks ago

for reference: https://github.com/CPernet/metaprivBIDS

schoffelen commented 3 weeks ago

OK, I wanted to try my luck today, but failed. I have trouble installing metaprivBIDS according to the instructions (both trying the documentation of @CPernet 's github, i.e. the README which seems incomplete) and the getting_started rst file in the docs. There seems to be an issue with installing/building pygraphviz. Giving up for now.

EmiKib commented 3 weeks ago

@schoffelen does it fail at "conda install graphviz pygraphviz" or at "pip install -e . " ?

schoffelen commented 3 weeks ago

@schoffelen does it fail at "conda install graphviz pygraphviz" or at "pip install -e . " ?

Hi @EmiKib thanks for getting back about this. I must confess that I haven't tried 'conda install', I did pip install for the graphviz etc.

I freshly cafeinated myself (and the compute cluster), and tried again just yet. With conda install I make it through the installation process. Thanks! PS: you may consider to add a line to the README.md to cd into the metaPrivBIDS repo (after git cloning it) before calling pip install -e .

EmiKib commented 3 weeks ago

@schoffelen great that it works now. I will add that cd line now. If you have any suggestions in regards to anything in the app (design, buttons, functionality), albeit I am currently working on making it a bit more visually pleasing, please let me know. I am currently on holiday, but I will try to be fast at replying.

schoffelen commented 3 weeks ago

@schoffelen great that it works now. I will add that cd line now. If you have any suggestions in regards to anything in the app (design, buttons, functionality), albeit I am currently working on making it a bit more visually pleasing, please let me know. I am currently on holiday, but I will try to be fast at replying.

Thanks, I will play around a bit. Don't bother replying during your holiday. Enjoy!