Open AlexWorldD opened 5 years ago
Hi, Marcel! I'm currently going deeper with our cleaning staff, but a bit confused with the possible source of wrong values. Because according to your code
df.amount = df.amount.apply(lambda x: x.replace(",", "."))
df.amount = df.amount.astype(float)
the second row should raise an error if it gets str as an argument, but you don't get any error during execution of your part of the code.
Can I ask you to run the following function (after the mentioned above code) to be sure that it's true: countDirtyData(df)
def countStrings(df, col=["amount"]):
output = dict()
for title in col:
output[title] = df[title].map(lambda x: 1 if type(x)==str else 0).sum()
return output
def countNaN(df, col=["amount"]):
output = dict()
for title in col:
output[title] = df[title].isnull().sum()
return output
def countDirtyData(df, col=["amount"]):
print("Strings in numeric columns: ", countStrings(df, col))
print("NaN in numeric columns: ", countNaN(df, col))
I think the nan had to do with entries that are all zero as mentioned in the other issue. I’ll give your code a try but that will be tomorrow.
Kind regards,
Marcel Boersma
On May 4, 2019, at 6:29 PM, Alex Malyutin notifications@github.com wrote:
Hi, Marcel! I'm currently going deeper with our cleaning staff, but a bit confused with the possible source of wrong values. Because according to your code
df.amount = df.amount.apply(lambda x: x.replace(",", ".")) df.amount = df.amount.astype(float) the second row should raise an error if it gets str as an argument, but you don't get any error during execution of your part of the code.
Can I ask you to run the following function (after the mentioned above code) to be sure that it's true: countDirtyData(df)
def countStrings(df, col=["amount"]): output = dict() for title in col: output[title] = d[title].map(lambda x: 1 if type(x)==str else 0).sum() return output def countNaN(df, col=["amount"]): output = dict() for title in col: output[title] = d[title].isnull().sum() return output def countDirtyData(df, col=["amount"]): print("Strings in numeric columns: ", countStrings(df, col)) print("NaN in numeric columns: ", countNaN(df, col))
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Hi!
It is not actual any more. Because I've added NaNs and String cleaning as well as the case you described in the other issue. At least for my experiment, it works fine. Feel free to see the notebook.
I did my best with Python script. You should uncomment the following line # MODE = "RealData"
for working with your data, but first of all, I recommend to test it with simulated one more time on your own laptop.
Alex
PS - Negative values in Credit/Debit should be fixed with the following part of the data pre-processing procedure:
row["Credit"] = abs(row["Value"]) if row["type"] == "credit" else 0.0
row["Debit"] = abs(row["Value"]) if row["type"] == "debit" else 0.0
that's why I've written modified function _preparedataMarcel which operates with your initial DataFrame and it is important to have the following columns
["transactionID", "accountID", "BR", "amount", "type"]
For renaming I recommend to use _renamecolumns() from NetEmbs.DataProcessing
rename_columns(d, names={"transactionID": "ID", "accountID": "FA_Name", "BR": "GroundTruth", "amount": "Value"})
Thanks! The computer analysed data for 48 hours and most datasets went through fine:)
Kind regards,
Marcel Boersma
On May 5, 2019, at 2:36 AM, Alex Malyutin notifications@github.com wrote:
PS - Negative values in Credit/Debit should be fixed with the following part of the data pre-processing procedure:
row["Credit"] = abs(row["Value"]) if row["type"] == "credit" else 0.0 row["Debit"] = abs(row["Value"]) if row["type"] == "debit" else 0.0 that's why I've written modified function prepare_dataMarcel which operates with your initial DataFrame and it is important to have the following columns
["transactionID", "accountID", "BR", "amount", "type"] For renaming I recommend to use rename_columns() from NetEmbs.DataProcessing
rename_columns(d, names={"transactionID": "ID", "accountID": "FA_Name", "BR": "GroundTruth", "amount": "Value"}) — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
@AlexWorldD everything seems to work but I do get this message:
--- Logging error ---
Final shape of DataFrame is (36211, 7)
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 983, in emit
msg = self.format(record)
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 829, in format
return fmt.format(record)
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 569, in format
record.message = record.getMessage()
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 331, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevconsole.py", line 511, in <module>
pydevconsole.start_server(host, int(port), int(client_port), client_host)
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevconsole.py", line 336, in start_server
process_exec_queue(interpreter)
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevconsole.py", line 192, in process_exec_queue
more = interpreter.add_exec(code_fragment)
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/_pydev_bundle/pydev_console_utils.py", line 281, in add_exec
more = self.do_add_exec(code_fragment)
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/_pydev_bundle/pydev_ipython_console.py", line 41, in do_add_exec
res = bool(self.interpreter.add_exec(code_fragment.text))
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/_pydev_bundle/pydev_ipython_console_011.py", line 442, in add_exec
self.ipython.run_cell(line, store_history=True)
File "/Users/mboersma/PycharmProjects/networkembedding/venv/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 2705, in run_cell
interactivity=interactivity, compiler=compiler, result=result)
File "/Users/mboersma/PycharmProjects/networkembedding/venv/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 2815, in run_ast_nodes
if self.run_code(code, result):
File "/Users/mboersma/PycharmProjects/networkembedding/venv/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 2869, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-2-3a01a08030d8>", line 1, in <module>
runfile('/Users/mboersma/Documents/phd/students/alex/NetEmbs-dev-4/model/MarcelExperiments.py', wdir='/Users/mboersma/Documents/phd/students/alex/NetEmbs-dev-4/model')
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-dev-4/model/MarcelExperiments.py", line 48, in <module>
d = prepare_dataMarcel(d)
File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-dev-4/NetEmbs/DataProcessing/prepare_data.py", line 115, in prepare_dataMarcel
local_logger.info("Final shape of DataFrame is ", original_df.shape)
Message: 'Final shape of DataFrame is '
Arguments: ((36211, 7),)
Ohh, my mistake, sorry. Logging package deals only with strings, I'd forgotten to convert a tuple to str. Simply change line 115 in _preparedata.py to
local_logger.info("Final shape of DataFrame is "+str(original_df.shape))
or pull from git.
Try to run for the first time only the following part of given script
if __name__ == '__main__':
print("Welcome to NetEmbs application!")
MAIN_LOGGER = log_me()
MAIN_LOGGER.info("Started..")
if MODE == "SimulatedData":
d = upload_data("../Simulation/FSN_Data.db", limit=1000)
d = prepare_data(d)
if MODE == "RealData":
# //////// UPLOAD your data HERE \\\\\\\\\\
d = bData()
# //////// END \\\\\\\\\\
d = rename_columns(d,
names={"transactionID": "ID", "accountID": "FA_Name", "BR": "GroundTruth",
"amount": "Value"})
d = prepare_dataMarcel(d)
# Now we should have good and clean dataset
# let's check it
countDirtyData(d, ["Debit", "Credit"])
the final output in consoly should be like
Strings in numeric columns: {'Debit': 0, 'Credit': 0}
NaN in numeric columns: {'Debit': 0, 'Credit': 0}
After that, we can be sure that no string or NaN values will be set as weight in build() method.
see #3 for additional cleaning step