felixleopoldo / benchpress

A Snakemake workflow to run and benchmark structure learning (a.k.a. causal discovery) algorithms for probabilistic graphical models.
https://benchpressdocs.readthedocs.io
GNU General Public License v2.0
65 stars 17 forks source link

Ground truth adjacent matrix's column order might need to be the same one as dataset. #92

Closed yasu-sh closed 1 year ago

yasu-sh commented 1 year ago

This would be true since I noticed the large SHD number are obtained at no-bootstrapping even I get reasonalbe result with bootstrapping in tetrad from my eyes in plot. If it is true, it is important for users.

Dataset: alarm(made from bnlearn by me) left: ground truth / center: without bootstrapping / right: with bootstrapping = 5 image

Diffplot image

Graph structure image

felixleopoldo commented 1 year ago

Sorry, I don't see your point, but it's nice that you verify these modules. Did you find an error? As an example, I have the alarm example from here which seems to be the same as the one I use here.

The script I used for turning a bnlearn network to adjacency matrix is here.

felixleopoldo commented 1 year ago

I see, it is the order of the variables you are talking about? I'll have another look.

felixleopoldo commented 1 year ago

What you you mean by bootstrapping, did you add another parameter to the tetrad_fges module?

yasu-sh commented 1 year ago

@felixleopoldo I am sorry. I should have made everything to report this. (I forgot adding the words, "I will investigate later/tommorow.")

I was wondering whether the evaluation metrics/plotting module is not consider on the discrepancy the order of variables or not. Let me have some time for check this. I spent several hours for this phenomena.

As for the bootstrapping, I added the parameter to check the effects as following your tutorial at UAI2023. It is not the point in this case. I will be checking by asia dataset.

[Preparation] Dataset: obtained in R console. data(alarm) Adjacent Matrix: created as following bnlearn help(below).

> alarm_gt <- bnlearn::model2network(paste0(
+ "[HIST|LVF][CVP|LVV][PCWP|LVV][HYP][LVV|HYP:LVF][LVF]",
+ "[STKV|HYP:LVF][ERLO][HRBP|ERLO:HR][HREK|ERCA:HR][ERCA][HRSA|ERCA:HR][ANES]",
+ "[APL][TPR|APL][ECO2|ACO2:VLNG][KINK][MINV|INT:VLNG][FIO2][PVS|FIO2:VALV]",
+ "[SAO2|PVS:SHNT][PAP|PMB][PMB][SHNT|INT:PMB][INT][PRSS|INT:KINK:VTUB][DISC]",
+ "[MVS][VMCH|MVS][VTUB|DISC:VMCH][VLNG|INT:KINK:VTUB][VALV|INT:VLNG]",
+ "[ACO2|VALV][CCHL|ACO2:ANES:SAO2:TPR][HR|CCHL][CO|HR:STKV][BP|CO:TPR]"))
> bnlearn::amat(alarm_gt)
     ACO2 ANES APL BP CCHL CO CVP DISC ECO2 ERCA ERLO FIO2 HIST HR HRBP HREK HRSA HYP INT KINK LVF LVV MINV MVS PAP PCWP PMB PRSS PVS SAO2 SHNT STKV TPR VALV VLNG VMCH VTUB
ACO2    0    0   0  0    1  0   0    0    1    0    0    0    0  0    0    0    0   0   0    0   0   0    0   0   0    0   0    0   0    0    0    0   0    0    0    0    0
ANES    0    0   0  0    1  0   0    0    0    0    0    0    0  0    0    0    0   0   0    0   0   0    0   0   0    0   0    0   0    0    0    0   0    0    0    0    0
yasu-sh commented 1 year ago

Before checking my creation from bnlearn asia dataset, I have made some result discrepancy from sachs dataset in benchpress repos itself. I would have like to understand why the discrepancy happens. diffplots and benchmark metrics are my concerns.

When you use the column-order-reversed dataset,

The steps in resources/data/mydatasets directory:

  1. Making the opposite column order of sachs dataset in R

    > sachs.data <- data.table::fread("2005_sachs_2_cd3cd28icam2_log_std.csv")
    > head(sachs.data,2)
          Akt        Erk        Jnk         Mek         P38      PIP2       PIP3         PKA        PKC       Plcg        Raf
    1: -0.6343361 -0.1117883 -0.3707515 -0.58558428 -0.06458972 0.6818205 -0.3240229 -0.04326735 -0.6878319 -0.3955337 -0.5148379
    2: -3.0409103 -2.5379116  1.0548648 -0.08291055 -0.10231212 1.6658269  1.1813047 -4.07209170  0.2993658  0.6777917 -0.1101130
    > data.table::setcolorder(sachs.data, sort(colnames(sachs.data), decreasing = T))
    > head(sachs.data,2)
          Raf       Plcg        PKC         PKA       PIP3      PIP2         P38         Mek        Jnk        Erk        Akt
    1: -0.5148379 -0.3955337 -0.6878319 -0.04326735 -0.3240229 0.6818205 -0.06458972 -0.58558428 -0.3707515 -0.1117883 -0.6343361
    2: -0.1101130  0.6777917  0.2993658 -4.07209170  1.1813047 1.6658269 -0.10231212 -0.08291055  1.0548648 -2.5379116 -3.0409103
    > data.table::fwrite(sachs.data, "2005_sachs_2_cd3cd28icam2_log_std_colorder_dec.csv")
  2. Executing paper_sachs.json normally

Left diffplot - config/paper_sachs.json

{
    "benchmark_setup": {
        "data": [
            {
                "graph_id": "sachs.csv",
                "parameters_id": null,
                "data_id": "2005_sachs_2_cd3cd28icam2_log_std.csv",
                "seed_range": null
            }
        ],

image

  1. Executing paper_sachs.json at the dataset exchanged to the created dataset above.

Right diffplot - config/paper_sachs.json

{
    "benchmark_setup": {
        "data": [
            {
                "graph_id": "sachs.csv",
                "parameters_id": null,
                "data_id": "2005_sachs_2_cd3cd28icam2_log_std_colorder_dec.csv",
                "seed_range": null
            }
        ],

image

felixleopoldo commented 1 year ago

Thanks. I think the order of the dataset and the columns of the adjacency matrix have to be the same. For the Sachs data I reordered and renamed manually the data columns (as far as I remember) to match the graph from bnlearn.

Did you find an example generated by bp where thee is a mismatch of orders?

yasu-sh commented 1 year ago

Thanks for telling your data preparation.

Did you find an example generated by bp where thee is a mismatch of orders?

Yes and No.

     bnlearn's output by using amat function: alphabet order in columns      bnlearn's built-in dataset like alarm has no alphabet order in columns

This codereflects the internal order of tetrad graph instance.

adjmat.csv_out_tetrad_without_bootstrap.txt

Graph Nodes:
CVP;PCWP;HIST;TPR;BP;CO;HRBP;HREK;HRSA;PAP;SAO2;FIO2;PRSS;ECO2;MINV;MVS;HYP;LVF;APL;ANES;PMB;INT;KINK;DISC;LVV;STKV;CCHL;ERLO;HR;ERCA;SHNT;PVS;ACO2;VALV;VLNG;VTUB;VMCH

adjmat.csv_out_tetrad_with_bootstrap.txt

Graph Nodes:
ACO2;ANES;APL;BP;CCHL;CO;CVP;DISC;ECO2;ERCA;ERLO;FIO2;HIST;HR;HRBP;HREK;HRSA;HYP;INT;KINK;LVF;LVV;MINV;MVS;PAP;PCWP;PMB;PRSS;PVS;SAO2;SHNT;STKV;TPR;VALV;VLNG;VMCH;VTUB

[Information] image adjacent matrix adjmat_tetrad_fges_estimated_withoutbootstrap.csv alarm_gt_amat.csv adjmat_tetrad_fges_estimated_withbootstrap.csv

yasu-sh commented 1 year ago

Even tetrad's case without bootstrapping(normal benchpress case), the output nodes order is the same as dataset's colmun order. The conclusion is the same. [fix] dataset needs to have the same column order with ground truth adjacent matrix one.

felixleopoldo commented 1 year ago

Then I think it's OK, looking at the data file and the graph file will also be less confusing when the columns are consistent. But there should be a text clarifying this for scenario II, somewhere here.

yasu-sh commented 1 year ago

I am glad to hear that. It's good.

For beginners, some users may firstly refer to 'file format' section. It might be a good choice to add a link from the page below to scenario II page. https://benchpressdocs.readthedocs.io/en/latest/data_formats.html#observational-data

felixleopoldo commented 1 year ago

Yes, that might be good as well.

yasu-sh commented 1 year ago

Thanks for telling this. Should I close this issue now or after updating documents?

felixleopoldo commented 1 year ago

Sure, maybe you can cave a look here first. I added a note just at one place not to make it more confusing.

yasu-sh commented 1 year ago

I did it. I am satisfied to have the consistency and I do not have to see garbage (terrible SHD or diffplot).