cmu-phil / py-tetrad

Makes algorithms/code in Tetrad available in Python via JPype
MIT License
62 stars 12 forks source link

Null Pointer Exception Error attempting to run FGES-Mb through JPype #19

Closed cdecker8 closed 9 months ago

cdecker8 commented 9 months ago

I'm encountering a java.lang.NullPointerException when attempting to run the FGES-Mb algorithm on a target node using JPype. Despite confirming the presence of the target node within the score object data through various print statements, the NullPointerException persists during algorithm execution.

I suspect that the issue might stem from my limited experience with JPype (I'm admittedly a JPype novice and have limited Java experience beyond occasionally popping into the tetrad javadocs) or the method I'm using to add the target node to the target_list. Despite extensive testing, the error persists, suggesting that I may be overlooking a crucial aspect in this process.

Any guidance or suggestions on troubleshooting and resolving this NullPointerException would be greatly appreciated. Thank you!

Error message and stacktrace None java.lang.NullPointerException at edu.cmu.tetrad.graph.EdgeListGraph.getAdjacentNodes(EdgeListGraph.java:555) at edu.cmu.tetrad.search.FgesMb.search(FgesMb.java:213)

My code

import jpype.imports
from jpype import JClass, JString, getDefaultJVMPath, shutdownJVM, startJVM, java, shutdownJVM, JPackage
import pandas as pd
import numpy as np
import graphviz
import sys
import pickle
import json
import dateutil
import datetime
BASE_DIR = "/workspace/notebooks/Causal/py-tetrad/pytetrad/"
sys.path.append(BASE_DIR)
date=datetime.datetime.now().strftime("%Y-%m-%d")
#convert to string
date=str(date)
if jpype.isJVMStarted():
    jpype.shutdownJVM()
else:
    try: #start a new jvm to clear memory heap and avoid memory error
        jpype.startJVM(classpath=[f"{BASE_DIR}/resources/tetrad-current.jar"])
        print(jpype.java.lang.System.getProperty("java.class.path"))

    except OSError:
        pass
#these packages need to be imported after starting jvm
import edu.cmu.tetrad.search as ts
import tools.translate as tr
import edu.cmu.tetrad.data as td
import edu.cmu.tetrad.graph as tg
import edu.cmu.tetrad.annotation as ta
from java.util import ArrayList

for fold in range(1,5):

    print('reading in data from ' + str(fold))
    df=pd.read_csv(f'/Data/processed/cleaner/imputedfolds/dftet_full_fold_{fold}.csv')

    if 'Unnamed: 0' in df.columns:
        df.drop('Unnamed: 0', axis=1, inplace=True)
    if 'age_years' in df.columns:
        df.drop('age_years', axis=1, inplace=True)

    Target='CSSS_womeans'
    #Target='report_kill_self_others_Value_Suicidal'

    if Target == 'CSSS_womeans':
        if 'report_kill_self_others_Value_Suicidal' in df.columns:
            df.dropna(subset=["report_kill_self_others_Value_Suicidal"], inplace=True)
    elif Target == 'report_kill_self_others_Value_Suicidal':
        if 'CSSS_womeans' in df.columns:
            df.dropna(subset=["CSSS_womeans"], inplace=True)

    #ch

    dataSet=tr.pandas_data_to_tetrad(df)

    #create markov blanket

    score = ts.score.BdeuScore(dataSet)
    #test=ts.test.IndTestChiSquare(dataSet, 0.01)
    alg=JClass('edu.cmu.tetrad.search.FgesMb')(score)
    EdgeListGraph = JClass('edu.cmu.tetrad.graph.EdgeListGraph')() #try instantiating  blank edge list graph?
    print('running search')
    alg.setVerbose(True)

    if Target in df.columns:
        target_node=td.DiscreteVariable(Target) #instatiate target node to pass to target list
        target_list = ArrayList()
        target_list.add(target_node)
        graph=[var.getName() for var in score.getVariables()]        
        target_name=target_node.getName()
        print('target name is ' + str(target_name))
        if target_name in graph:
            print('target in graph')
            if alg is not None:
                if target_list is not None:
                    try:
                        search=alg.search(target_list) #FIX ME: BREAKS HERE with java.lang.NullPointerException
                    except jpype.JException as e:
                        print(e.message())
                        print(e.stacktrace())
                else:
                    print('target list is None')
            else:
                print('alg is None')

        else:
            print('Target not in graph')
            print([var.getName() for var in graph])
    else:
        print('Target not in df')

     #get edge list and matrix from search object

    if search is not None:
        print('getting edge list and matrix from search object')
        edge_list = search.getEdges()
        edge_list = edge_list.toString()

        #save edge list to file
        with open(f'/workspace/notebooks/Causal/results/mb_edge_list_{fold}_{date}.txt', 'w') as f:
            f.write(str(edge_list))
        print('writing graph to file')
        gdot = graphviz.Graph(format='pdf', 
                    engine='dot', 
                    graph_attr={'viewport': '3000', 
                                'outputorder': 'edgesfirst'})
        tr.write_gdot(search, gdot)
        gdot.render(f'/workspace/notebooks/Causal/results/mb_graph_{fold}_{date}', cleanup=True, quiet=True)

print('shutting down jvm')
shutdownJVM() #shutdown jvm to clear memory heap and avoid memory error
jdramsey commented 9 months ago

Oh I think I may know what the problem might be, a minor annoyance. I'll get back to you tomorrow with a solution or later tonight.

jdramsey commented 9 months ago

Here's some working code; there was some code to try to solve the main issue, which I deleted, but you'll have to add some of your code back obviously, sorry :-D

Also, I wasn't quite sure what you meant by the folds--is this what you meant, or did you want to do sampling with replacement?

import datetime
import sys

import jpype.imports
import pandas as pd
from sklearn.model_selection import train_test_split

# BASE_DIR = "/workspace/notebooks/Causal/py-tetrad/pytetrad/"
BASE_DIR = ""
sys.path.append(BASE_DIR)
date = datetime.datetime.now().strftime("%Y-%m-%d")
# convert to string
date = str(date)
if jpype.isJVMStarted():
    jpype.shutdownJVM()
else:
    try:  # start a new jvm to clear memory heap and avoid memory error
        jpype.startJVM(classpath=[f"{BASE_DIR}resources/tetrad-current.jar"])
        print(jpype.java.lang.System.getProperty("java.class.path"))

    except OSError:
        pass

# these packages need to be imported after starting jvm
import edu.cmu.tetrad.search as ts
import tools.translate as tr
from java.util import ArrayList

df = pd.read_csv(f"{BASE_DIR}resources/bridges.data.version211_rev.txt", sep="\t")
target = 'SPAN'

for fold in range(1, 5):
    print('reading in data from ' + str(fold))

    train, test = train_test_split(df, test_size=.1)

    dataSet = tr.pandas_data_to_tetrad(train)
    score = ts.score.BdeuScore(dataSet)

    alg = ts.FgesMb(score)

    print('running search')
    alg.setVerbose(True)

    target_node = dataSet.getVariable(target)
    target_list = ArrayList()
    target_list.add(target_node)

    print(target_list)

    graph = alg.search(target_list)
    print(graph)

    # save edge list to file
    with open(f'example_out.' + str(fold) + '.txt', 'w') as f:
        f.write(date + '\n\n')
        f.write(str(graph))
cdecker8 commented 9 months ago

Apologies for any confusion about the folds which are specific to my analysis pipeline. The for loop is intended for reading in pre-saved datasets, which essentially represent the same data but may exhibit slight variations due to the imputation process. These folds consist of pre-saved imputed k-folds, which were generated using a modified version of Multiple Imputation by Chained Equations (MICE), resulting in 0/1 imputations for missing data within a predictive machine learning model pipeline.

Each imputation (dataset) may differ slightly based on the selection of the test set it was imputed from, and it was important for me to maintain data consistency between the predictive and causal models for my dissertation.

In discussions with Erich K., we explored the idea of running all the folds and filtering out edges that only appear in 1 or 2 of the folds, as these could potentially stem from imputation noise. I've developed another script that reads the edge lists and calculates the number of searches they appear in. I'll test this approach and provide an update later today.

I truly appreciate your assistance and your time in addressing this matter. Thank you.

jdramsey commented 9 months ago

Oh that's clever, I like it!

cdecker8 commented 9 months ago

Thanks. Occasionally, we manage to thaw out some good ideas here in Minnesota, despite the chilly weather!

Though results are pending, the addition of edges suggests that the new code is functioning as intended. I suspect the error stemmed from my oversight in failing to extract the target node from the dataset and assuming I could simply pass the variable's name to the algorithm. Thanks again for your assistance with that. It likely saved me from days of debugging.