AlexWorldD / NetEmbs

Framework for Representation Learning on Financial Statement Networks
Apache License 2.0
1 stars 1 forks source link

Data Cleaning #11

Open AlexWorldD opened 5 years ago

boersmamarcel commented 5 years ago

see #3 for additional cleaning step

AlexWorldD commented 5 years ago

Hi, Marcel! I'm currently going deeper with our cleaning staff, but a bit confused with the possible source of wrong values. Because according to your code

df.amount = df.amount.apply(lambda x: x.replace(",", "."))
df.amount = df.amount.astype(float)

the second row should raise an error if it gets str as an argument, but you don't get any error during execution of your part of the code.

Can I ask you to run the following function (after the mentioned above code) to be sure that it's true: countDirtyData(df)

def countStrings(df, col=["amount"]):
    output = dict()
    for title in col:
        output[title] = df[title].map(lambda x: 1 if type(x)==str else 0).sum()
    return output
def countNaN(df, col=["amount"]):
    output = dict()
    for title in col:
        output[title] = df[title].isnull().sum()
    return output
def countDirtyData(df, col=["amount"]):
    print("Strings in numeric columns: ", countStrings(df, col))
    print("NaN in numeric columns: ", countNaN(df, col))  
boersmamarcel commented 5 years ago

I think the nan had to do with entries that are all zero as mentioned in the other issue. I’ll give your code a try but that will be tomorrow.

Kind regards,

Marcel Boersma

On May 4, 2019, at 6:29 PM, Alex Malyutin notifications@github.com wrote:

Hi, Marcel! I'm currently going deeper with our cleaning staff, but a bit confused with the possible source of wrong values. Because according to your code

df.amount = df.amount.apply(lambda x: x.replace(",", ".")) df.amount = df.amount.astype(float) the second row should raise an error if it gets str as an argument, but you don't get any error during execution of your part of the code.

Can I ask you to run the following function (after the mentioned above code) to be sure that it's true: countDirtyData(df)

def countStrings(df, col=["amount"]): output = dict() for title in col: output[title] = d[title].map(lambda x: 1 if type(x)==str else 0).sum() return output def countNaN(df, col=["amount"]): output = dict() for title in col: output[title] = d[title].isnull().sum() return output def countDirtyData(df, col=["amount"]): print("Strings in numeric columns: ", countStrings(df, col)) print("NaN in numeric columns: ", countNaN(df, col))
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

AlexWorldD commented 5 years ago

Hi!

It is not actual any more. Because I've added NaNs and String cleaning as well as the case you described in the other issue. At least for my experiment, it works fine. Feel free to see the notebook.


I did my best with Python script. You should uncomment the following line # MODE = "RealData" for working with your data, but first of all, I recommend to test it with simulated one more time on your own laptop.

Alex

AlexWorldD commented 5 years ago

PS - Negative values in Credit/Debit should be fixed with the following part of the data pre-processing procedure:

row["Credit"] = abs(row["Value"]) if row["type"] == "credit" else 0.0
row["Debit"] = abs(row["Value"]) if row["type"] == "debit" else 0.0

that's why I've written modified function _preparedataMarcel which operates with your initial DataFrame and it is important to have the following columns

["transactionID", "accountID", "BR", "amount", "type"]

For renaming I recommend to use _renamecolumns() from NetEmbs.DataProcessing

rename_columns(d, names={"transactionID": "ID", "accountID": "FA_Name", "BR": "GroundTruth", "amount": "Value"})
boersmamarcel commented 5 years ago

Thanks! The computer analysed data for 48 hours and most datasets went through fine:)

Kind regards,

Marcel Boersma

On May 5, 2019, at 2:36 AM, Alex Malyutin notifications@github.com wrote:

PS - Negative values in Credit/Debit should be fixed with the following part of the data pre-processing procedure:

row["Credit"] = abs(row["Value"]) if row["type"] == "credit" else 0.0 row["Debit"] = abs(row["Value"]) if row["type"] == "debit" else 0.0 that's why I've written modified function prepare_dataMarcel which operates with your initial DataFrame and it is important to have the following columns

["transactionID", "accountID", "BR", "amount", "type"] For renaming I recommend to use rename_columns() from NetEmbs.DataProcessing

rename_columns(d, names={"transactionID": "ID", "accountID": "FA_Name", "BR": "GroundTruth", "amount": "Value"}) — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

boersmamarcel commented 5 years ago

@AlexWorldD everything seems to work but I do get this message:

--- Logging error ---
Final shape of DataFrame is  (36211, 7)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 983, in emit
    msg = self.format(record)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 829, in format
    return fmt.format(record)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 569, in format
    record.message = record.getMessage()
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 331, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevconsole.py", line 511, in <module>
    pydevconsole.start_server(host, int(port), int(client_port), client_host)
  File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevconsole.py", line 336, in start_server
    process_exec_queue(interpreter)
  File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevconsole.py", line 192, in process_exec_queue
    more = interpreter.add_exec(code_fragment)
  File "/Applications/PyCharm CE.app/Contents/helpers/pydev/_pydev_bundle/pydev_console_utils.py", line 281, in add_exec
    more = self.do_add_exec(code_fragment)
  File "/Applications/PyCharm CE.app/Contents/helpers/pydev/_pydev_bundle/pydev_ipython_console.py", line 41, in do_add_exec
    res = bool(self.interpreter.add_exec(code_fragment.text))
  File "/Applications/PyCharm CE.app/Contents/helpers/pydev/_pydev_bundle/pydev_ipython_console_011.py", line 442, in add_exec
    self.ipython.run_cell(line, store_history=True)
  File "/Users/mboersma/PycharmProjects/networkembedding/venv/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 2705, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/Users/mboersma/PycharmProjects/networkembedding/venv/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 2815, in run_ast_nodes
    if self.run_code(code, result):
  File "/Users/mboersma/PycharmProjects/networkembedding/venv/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 2869, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-3a01a08030d8>", line 1, in <module>
    runfile('/Users/mboersma/Documents/phd/students/alex/NetEmbs-dev-4/model/MarcelExperiments.py', wdir='/Users/mboersma/Documents/phd/students/alex/NetEmbs-dev-4/model')
  File "/Applications/PyCharm CE.app/Contents/helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/Applications/PyCharm CE.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-dev-4/model/MarcelExperiments.py", line 48, in <module>
    d = prepare_dataMarcel(d)
  File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-dev-4/NetEmbs/DataProcessing/prepare_data.py", line 115, in prepare_dataMarcel
    local_logger.info("Final shape of DataFrame is ", original_df.shape)
Message: 'Final shape of DataFrame is '
Arguments: ((36211, 7),)
AlexWorldD commented 5 years ago

Ohh, my mistake, sorry. Logging package deals only with strings, I'd forgotten to convert a tuple to str. Simply change line 115 in _preparedata.py to

        local_logger.info("Final shape of DataFrame is "+str(original_df.shape))

or pull from git.

AlexWorldD commented 5 years ago

Try to run for the first time only the following part of given script

if __name__ == '__main__':
    print("Welcome to NetEmbs application!")
    MAIN_LOGGER = log_me()
    MAIN_LOGGER.info("Started..")
    if MODE == "SimulatedData":
        d = upload_data("../Simulation/FSN_Data.db", limit=1000)
        d = prepare_data(d)

    if MODE == "RealData":
        # //////// UPLOAD your data HERE \\\\\\\\\\
        d = bData()
        # //////// END  \\\\\\\\\\
        d = rename_columns(d,
                           names={"transactionID": "ID", "accountID": "FA_Name", "BR": "GroundTruth",
                                  "amount": "Value"})
        d = prepare_dataMarcel(d)
    # Now we should have good and clean dataset
    # let's check it
    countDirtyData(d, ["Debit", "Credit"])

the final output in consoly should be like

Strings in numeric columns:  {'Debit': 0, 'Credit': 0}
NaN in numeric columns:  {'Debit': 0, 'Credit': 0}

After that, we can be sure that no string or NaN values will be set as weight in build() method.