AlexWorldD / NetEmbs

Framework for Representation Learning on Financial Statement Networks
Apache License 2.0
1 stars 1 forks source link

generation of noise #9

Open boersmamarcel opened 5 years ago

boersmamarcel commented 5 years ago

Hi Aleksei,

Just a question about the noise generation in the toy-example. I see that you have a core-processes and then add random-noise accounts. For example

0.5 A + 0.49 B + 0.01X -> C

my question is are the noise accounts, X in the example, unique or can it be that I have a second processes where X is either part of the core, or, also participates as noise?

Thank you for clarifying my understanding.

Kind regards,

Marcel Boersma

AlexWorldD commented 5 years ago

Hi, Marcel!

Currently it's just a noise_name = randomString(6) for the following BPs: Sales, GoodsDelivery, Depreciation, Purchase, Payroll, so, with very low probability the second case is also possible for these FAs, but from statistical point of view - no. BUT, if you talk about sequential example - Sales+Collection, then for Collection BPs noisy financial account from the right part of Sale process become noisy FAs on the left side of Collection process:

self.addRecord("TradeReceivables_" + u_id, "TradeReceivables", -self.trade_rec, cur_transaction)
for key, item in noise["right"].items():
            self.addRecord(key, key, -item, cur_transaction)

self.addRecord("Cash_" + str(unique_id), "Cash", self.cash, cur_transaction)
AlexWorldD commented 5 years ago

Hi, Marcel! Below is code for creating a histogram over DataFrame for a number of LH and RH financial accounts per business process. Might be helpful.

import pandas as pd
from collections import Counter

def get_left_right(df):
    """
    Helper function for counting left-hand and right-hand account for BP
    :param df: grouped object
    :return: Series with number of FA on the left side and on the right side
    """
    return pd.Series({"Left": df[df["from"] == True].count()[0], "Right": df[df["from"] == False].count()[0]})

def getHistCounts(df):
    stat_here = df.groupby("ID", as_index=False).apply(get_left_right)
    res = dict()
    for n in list(stat_here):
        res[n] = Counter(stat_here[n])
    return res

def plotHist(df, title="Histogram"):
    stat_here = getHistCounts(df)
    from matplotlib.ticker import MaxNLocator
    for k, d in stat_here.items():
        ax = plt.figure().gca()
        ax.bar(d.keys(), d.values())
        ax.xaxis.set_major_locator(MaxNLocator(integer=True))
        plt.title(k + "-side number of FAs")
        if title is not None and isinstance(title, str):
            plt.tight_layout()
            plt.savefig("img/" + title + k, dpi=140, pad_inches=0.01)

Example with simulated data in Sandbox file.

Alex