AlexWorldD / NetEmbs

Framework for Representation Learning on Financial Statement Networks
Apache License 2.0
1 stars 1 forks source link

bug random walk #3

Closed boersmamarcel closed 5 years ago

boersmamarcel commented 5 years ago

when I run

from NetEmbs.FSN import *
randomWalk(fsn, 1, length=10, direction="COMBI")

I get

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-95ca35af2a0e> in <module>()
      1 from NetEmbs.FSN import *
----> 2 randomWalk(fsn, 1, length=10, direction="COMBI")

/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py in randomWalk(G, vertex, length, direction, version, return_full_path, debug)
    255             elif version == "MetaDiff":
    256                 if direction is "COMBI":
--> 257                     new_v = step(G, cur_v, cur_direction, mode=2, return_full_step=return_full_path, debug=debug)
    258                     cur_direction = mask[cur_direction]
    259                 else:

/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py in step(G, vertex, direction, mode, allow_back, return_full_step, pressure, debug)
    146         return vertex
    147     elif not G.has_node(vertex):
--> 148         raise ValueError("Vertex {!r} is not in FSN!".format(vertex))
    149     if direction == "IN":
    150         ins = G.in_edges(vertex, data=True)

ValueError: Vertex 1 is not in FSN!

any idea how I can fix this? Do you need more info? Then please let me know.

AlexWorldD commented 5 years ago

What are your BP IDs? Might be your data simply includes something like {2, 3, 5 etc} and no BP with ID equal 1, hence, you are getting an error. The arguments of function are the following:

def randomWalk(G, vertex=None, length=3, direction="IN", version="MetaDiff", return_full_path=False, debug=False):
    """
    RandomWalk function for sampling the sequence of nodes from given graph and initial node
    :param G: Bipartite graph, an instance of networkx
    :param vertex: initial node
    :param length: the maximum length of RandomWalk
    :param direction: The direction of walking. IN - go via source financial accounts, OUT - go via target financial accounts
    :param version: Version of step:
    "DefUniform" - Pure RandomWalk (uniform probabilities, follows the direction),
    "DefWeighted" - RandomWalk (weighted probabilities, follows the direction),
    "MetaUniform" - Default Metapath-version (uniform probabilities, change directions),
    "MetaWeighted" - Weighted Metapath version (weighted probabilities "rich gets richer", change directions),
    "MetaDiff" - Modified Metapath version (probabilities depend on the differences between edges, change directions)
    :param return_full_path: If True, return the full path with FA nodes
    :param debug: Debug boolean flag, print intermediate steps
    :return: Sampled sequence of nodes
    """
AlexWorldD commented 5 years ago

The set of BPs nodes could be get with the following method of FSN class:

fsn.get_BP()
boersmamarcel commented 5 years ago

yes, I have many weird business process IDS some are numbers and some are combinations of numbers and letters.

AlexWorldD commented 5 years ago

Then you need to use your actual BP ID for sampling a sequence from fsn

vertex=None

something like

from NetEmbs.FSN import *
randomWalk(fsn, "my_long_name123", length=10, direction="COMBI")
boersmamarcel commented 5 years ago

ok, so I just need to give the first item of the list

boersmamarcel commented 5 years ago

it seems to work! :)

boersmamarcel commented 5 years ago

I did a couple of test runs and at a quick glance things look good; I think the combi strategy gives the best results because others only match input or output; the all strategy I didn’t evaluate thoroughly yet. How is building the skipgram going?!

boersmamarcel commented 5 years ago

I get the following

Fatal ValueError during step Traceback (most recent call last): File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-dev-4/NetEmbs/FSN/utils.py", line 203, in step tmp_vertex = np.random.choice(outs, p=probas) File "mtrand.pyx", line 1144, in mtrand.RandomState.choice ValueError: probabilities contain NaN Fatal ValueError during step

boersmamarcel commented 5 years ago

@AlexWorldD any idea?

AlexWorldD commented 5 years ago

Hi! What are the last rows in log file? It should write the current node etc. if it gets an exception.

        except Exception as e:
            if LOG:
                snapshot = {"CurrentNode": tmp_vertex, "CurrentWeight": tmp_weight,
                            "NextCandidates": list(zip(outs, ws)), "Probas": probas}
                local_logger = logging.getLogger("NetEmbs.Utils.step")
                local_logger.error("Fatal ValueError during step", exc_info=True)
                local_logger.info("Snapshot" + str(snapshot))
AlexWorldD commented 5 years ago

Now we fill NA values during split_to_debit_credit() function, so, might be it's better to do it in a separate way? df.fillna(0.0, inplace=True)

AlexWorldD commented 5 years ago

Yes, that's the problem. You don't split data -> we don't execute that part of the code. One moment

AlexWorldD commented 5 years ago

Ok, at least now the input DataFrame is preprocessed with fillna() method. So, I guess the error has been fixed.

Alex

boersmamarcel commented 5 years ago

Awesome! Then I can run some more analytics tonight. Another question, I keep on getting the same colors when I have more than 8 categories or so. Can we fix this? Now it is hard to distinguish between the categories.

Kind regards,

Marcel Boersma

On May 3, 2019, at 8:43 AM, Alex Malyutin notifications@github.com wrote:

Ok, at least now the input DataFrame is preprocessed with fillna() method. So, I guess the error has been fixed.

Alex

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

AlexWorldD commented 5 years ago

Hmm, I should be fixed already. Now it combines different colors with different markers.

def plot_tSNE(fsn_embs, title="tSNE", rand_state=1, manual=False):
    import os
    os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'
    import matplotlib.pyplot as plt
    from sklearn.manifold import TSNE
    tsne = TSNE(random_state=rand_state)
    embdf = pd.DataFrame(list(map(np.ravel, fsn_embs.iloc[:, 1])))
    embed_tsne = tsne.fit_transform(embdf)
    fsn_embs["x"] = pd.Series(embed_tsne[:, 0])
    fsn_embs["y"] = pd.Series(embed_tsne[:, 1])
    import seaborn as sns
    markers = ["o", "v", "s"]
    cur_m=0
    if manual:
        plt.clf()
        n_gr = 0
        for name, group in fsn_embs.groupby("FA_Name"):
            n_gr+=1
            if n_gr>3:
                cur_m = cur_m+1 if len(markers)-1>cur_m else 0
                n_gr=0
            plt.scatter(group["x"].values, group["y"].values, s=150, marker=markers[cur_m], label=name)
#         sns.scatterplot(data=fsn_embs, x="x", y="y", hue="FA_Name", s=150)
        plt.legend(bbox_to_anchor=(1.3, 1), loc="upper right", frameon=False, markerscale=1)
    else:
        fg = sns.FacetGrid(data=fsn_embs, hue='FA_Name', aspect=1.61, height=6, legend_out=True)
        fg.map(pyplot.scatter, 'x', 'y')
        fg.add_legend()
    if title is not None and isinstance(title, str):
        plt.tight_layout()
        plt.savefig("img/" + title, dpi=140, pad_inches=0.01)
    plt.show()
    return fsn_embs
def set_font(s, reset=False):
    if reset:
        plt.rcParams.update(plt.rcParamsDefault)
    plt.rcParams["figure.figsize"] = [20,10]
#     plt.rcParams['font.family'] = 'serif'
#     plt.rcParams['font.serif'] = ['Times New Roman'] + plt.rcParams['font.serif']
    plt.rc('font', size=s)          # controls default text sizes
    plt.rc('axes', titlesize=s)     # fontsize of the axes title
    plt.rc('axes', labelsize=s)    # fontsize of the x and y labels
    plt.rc('xtick', labelsize=s-2)    # fontsize of the tick labels
    plt.rc('ytick', labelsize=s-2)    # fontsize of the tick labels
    plt.rc('legend', fontsize=s)    # legend fontsize
    plt.rc('figure', titlesize=s)  # fontsize of the figure title

rand_seed = 2
set_font(20)
_ = plot_tSNE(res, "FastTrain10k", rand_seed, manual=True)
boersmamarcel commented 5 years ago

@AlexWorldD I tried again but I keep receiving:

Traceback (most recent call last):
  File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-dev-4/NetEmbs/FSN/utils.py", line 203, in step
    tmp_vertex = np.random.choice(outs, p=probas)
  File "mtrand.pyx", line 1144, in mtrand.RandomState.choice
ValueError: probabilities contain NaN
Fatal ValueError during step
AlexWorldD commented 5 years ago

Ok, that's weird. What is in log file? logs.log in your project directory

AlexWorldD commented 5 years ago
from NetEmbs.Logs.custom_logger import log_me
MAIN_LOGGER = log_me()
MAIN_LOGGER.info("Started..")
boersmamarcel commented 5 years ago

I found one entry again and noticed the following:

Single journal entry A->B where all amounts in that entry are zero. Thus

name debit credit
a 0 0
b 0 0

These are small errors in the data itself, we can filter these transactions in the data preparation step.

AlexWorldD commented 5 years ago

That case should be captured as NaNs during normalization procedure - dividing by zero, hence, the current version of _preparedata function should work

if norm:
        original_df = normalize(original_df)
    #     Remove rows with NaN values after normalization (e.g. when all values were 0.0 -> something/zero leads to NaN)
original_df.dropna(subset=["Debit", "Credit"], inplace=True)

But again, it works for my test cases... Hope it'll be also OK for real data...