Closed andreasmeid closed 4 years ago
Hi @andreasmeid -- thanks for raising the issue (but maybe edit the question [an option under the '...' menu in the top right corner] to be in English).
To be truly reproducible we need to know how y.npy and y.rds were created. See what I did in tests/ which includes createFiles.py and its use in the initial vignette.
Lastly, also see the reticulate vignette for an alternative way to get to your data. If that works then we likely miss a transpose or something.
Using reticulate
:
R> library(reticulate)
R> np <- import("numpy")
R> b <- np$load("y.npy")
R> b[0:20]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
R>
It is possible that something is wrong with integers
Looks good and appears to be the reason indeed. Due to my proxy-restriction, I am not able to import python modules when using reticulate, so I cannot fully reproduce it with the reticulate package.
I still do not know how you created y.npy
and y.rds
.
The numpy file has the following origin: df = pd.read_csv("data.csv")
y = np.array(df["cvd"]).astype(np.int32) It is finally saved with np.save(base_dir + "/y.npy")
The rds-R-file has the following origin: df = read.table("data.csv", header=TRUE, sep=",") y = as.numeric(df[,"cvd"]) It is finally saved with saveRDS(y, "y.rds")
è Maybe the conversion from Boolean to integer is an issue?
Von: Dirk Eddelbuettel [mailto:notifications@github.com] Gesendet: Freitag, 19. Juli 2019 17:28 An: eddelbuettel/rcppcnpy Cc: Meid, Andreas; Mention Betreff: Re: [eddelbuettel/rcppcnpy] Different Result via numpy Read (#25)
I still do not know how you created y.npy and y.rds.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/eddelbuettel/rcppcnpy/issues/25?email_source=notifications&email_token=AE2MTV7ZF3DP6J2CDM3IV7TQAHMPLA5CNFSM4IEYNNR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2L6WWY#issuecomment-513272667, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE2MTV2W7KP2VGIZOHJM42TQAHMPLANCNFSM4IEYNNRQ.
That STILL hasen't made it reproducible as data.csv
is not available.
Also in the code you show you use as.numeric()
on the R side. Maybe you meant as.integer
if you wanted integer on both sides?
as.numeric() might be a reason, let’s see, but that was the original code. In addition, the first 100 lines of the source data are now provided at https://github.com/andreasmeid/RcppCNPy
Von: Dirk Eddelbuettel [mailto:notifications@github.com] Gesendet: Montag, 22. Juli 2019 12:47 An: eddelbuettel/rcppcnpy Cc: Meid, Andreas; Mention Betreff: Re: [eddelbuettel/rcppcnpy] Different Result via numpy Read (#25)
That STILL hasen't made it reproducible as data.csv is not available.
Also in the code you show you use as.numeric() on the R side. Maybe you meant as.integer if you wanted integer on both sides?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/eddelbuettel/rcppcnpy/issues/25?email_source=notifications&email_token=AE2MTV5I5TOL6GELOB4FRELQAWFY3A5CNFSM4IEYNNR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2PRA6A#issuecomment-513740920, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE2MTV6KKWOYFE7YEU3JY63QAWFY3ANCNFSM4IEYNNRQ.
Please do me a favour and post complete and reproducible steps. It is, quite frankly, a little tedious that I still have to beg you about that even after five or emails or messages.
edd@rob:/tmp/rcppcnpy-bugreport(master)$ ./createData.py
Traceback (most recent call last):
File "./createData.py", line 11, in <module>
y = np.array(df["cvd"]).astype(np.int32)
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 2688, in __getitem__
return self._getitem_column(key)
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 2695, in _getitem_column
return self._get_item_cache(key)
File "/usr/lib/python2.7/dist-packages/pandas/core/generic.py", line 2491, in _get_item_cache
values = self._data.get(item)
File "/usr/lib/python2.7/dist-packages/pandas/core/internals.py", line 4115, in get
loc = self.items.get_loc(item)
File "/usr/lib/python2.7/dist-packages/pandas/core/indexes/base.py", line 3080, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'cvd'
edd@rob:/tmp/rcppcnpy-bugreport(master)$
Code does not write itself and I still cannot reproduce your issue. My patience and willingness to debug your problem goes down with each attempt.
Puh, sorry that I‘m not that familiar in Python; I took the code lines from another project and, honestly, was happy enough that the Python code ran at all. In R, there seems to be no problem comparing the output with the source file. So, pragmatically, I’ll avoid Python for this step.
Von: Dirk Eddelbuettel [mailto:notifications@github.com] Gesendet: Montag, 22. Juli 2019 13:03 An: eddelbuettel/rcppcnpy Cc: Meid, Andreas; Mention Betreff: Re: [eddelbuettel/rcppcnpy] Different Result via numpy Read (#25)
Please do me a favour and post complete and reproducible steps. It is, quite frankly, a little tedious that I still have to beg you about that even after five or emails or messages.
edd@rob:/tmp/rcppcnpy-bugreport(master)$ ./createData.py
Traceback (most recent call last):
File "./createData.py", line 11, in
y = np.array(df["cvd"]).astype(np.int32)
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 2688, in getitem
return self._getitem_column(key)
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 2695, in _getitem_column
return self._get_item_cache(key)
File "/usr/lib/python2.7/dist-packages/pandas/core/generic.py", line 2491, in _get_item_cache
values = self._data.get(item)
File "/usr/lib/python2.7/dist-packages/pandas/core/internals.py", line 4115, in get
loc = self.items.get_loc(item)
File "/usr/lib/python2.7/dist-packages/pandas/core/indexes/base.py", line 3080, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'cvd'
edd@rob:/tmp/rcppcnpy-bugreport(master)$
Code does not write itself and I still cannot reproduce your issue. My patience and willingness to debug your problem goes down with each attempt.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/eddelbuettel/rcppcnpy/issues/25?email_source=notifications&email_token=AE2MTV73IAKMJEKJX5PJODLQAWHVNA5CNFSM4IEYNNR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2PSFXQ#issuecomment-513745630, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE2MTV4HFJTHIGJ567JPJ5TQAWHVNANCNFSM4IEYNNRQ.
For closure: This was operator error. The Python file we compared against was written the wrong way. Even though R only has 32-bit integers, we need 64 bit integers from Python. So something like the following worked:
#!/usr/bin/python
import pandas as pd
import numpy as np
base_dir = "newdata"
df = pd.read_csv("data.csv")
## NB 1: data file does not correspond to code example, column is called 'x'
## NB 2" cast to int64 is important
y = np.array(df["x"]).astype(np.int64)
#print(y.dtype)
np.save(base_dir + "/y.npy", y)
Mir ist aufgefallen, dass beim Einlesen einer numpy-Datei ein abweichendes Ergebnis zur entsprechenden Quelldatei im rds-Format resultierte:
[I noticed that when reading a numpy-file values differ between reading numpy and rds.]
[1] "R head" [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 [1] "numpy head" [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0
Die Dateien mitsamt R-Code liegen unter https://github.com/andreasmeid/RcppCNPy
[File and code at repo.]
[Edits by @eddelbuetttel who continues to point out that this is not really reproducible as we do not what created
y.npy
andy.rds
._]