eddelbuettel / rcppcnpy

Rcpp bindings for NumPy files
GNU General Public License v2.0
26 stars 16 forks source link

Different Result via numpy Read #25

Closed andreasmeid closed 4 years ago

andreasmeid commented 4 years ago

Mir ist aufgefallen, dass beim Einlesen einer numpy-Datei ein abweichendes Ergebnis zur entsprechenden Quelldatei im rds-Format resultierte:

[I noticed that when reading a numpy-file values differ between reading numpy and rds.]

[1] "R head" [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 [1] "numpy head" [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0

Die Dateien mitsamt R-Code liegen unter https://github.com/andreasmeid/RcppCNPy

[File and code at repo.]

[Edits by @eddelbuetttel who continues to point out that this is not really reproducible as we do not what created y.npy and y.rds._]

eddelbuettel commented 4 years ago

Hi @andreasmeid -- thanks for raising the issue (but maybe edit the question [an option under the '...' menu in the top right corner] to be in English).

To be truly reproducible we need to know how y.npy and y.rds were created. See what I did in tests/ which includes createFiles.py and its use in the initial vignette.

Lastly, also see the reticulate vignette for an alternative way to get to your data. If that works then we likely miss a transpose or something.

eddelbuettel commented 4 years ago

Using reticulate:

R> library(reticulate)
R> np <- import("numpy")
R> b <- np$load("y.npy")
R> b[0:20]
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
R>

It is possible that something is wrong with integers

andreasmeid commented 4 years ago

Looks good and appears to be the reason indeed. Due to my proxy-restriction, I am not able to import python modules when using reticulate, so I cannot fully reproduce it with the reticulate package.

eddelbuettel commented 4 years ago

I still do not know how you created y.npy and y.rds.

andreasmeid commented 4 years ago

The numpy file has the following origin: df = pd.read_csv("data.csv")

"","cvd"

"1",FALSE

"2",FALSE

y = np.array(df["cvd"]).astype(np.int32) It is finally saved with np.save(base_dir + "/y.npy")

The rds-R-file has the following origin: df = read.table("data.csv", header=TRUE, sep=",") y = as.numeric(df[,"cvd"]) It is finally saved with saveRDS(y, "y.rds")

è Maybe the conversion from Boolean to integer is an issue?

Von: Dirk Eddelbuettel [mailto:notifications@github.com] Gesendet: Freitag, 19. Juli 2019 17:28 An: eddelbuettel/rcppcnpy Cc: Meid, Andreas; Mention Betreff: Re: [eddelbuettel/rcppcnpy] Different Result via numpy Read (#25)

I still do not know how you created y.npy and y.rds.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/eddelbuettel/rcppcnpy/issues/25?email_source=notifications&email_token=AE2MTV7ZF3DP6J2CDM3IV7TQAHMPLA5CNFSM4IEYNNR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2L6WWY#issuecomment-513272667, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE2MTV2W7KP2VGIZOHJM42TQAHMPLANCNFSM4IEYNNRQ.

eddelbuettel commented 4 years ago

That STILL hasen't made it reproducible as data.csv is not available.

Also in the code you show you use as.numeric() on the R side. Maybe you meant as.integer if you wanted integer on both sides?

andreasmeid commented 4 years ago

as.numeric() might be a reason, let’s see, but that was the original code. In addition, the first 100 lines of the source data are now provided at https://github.com/andreasmeid/RcppCNPy

Von: Dirk Eddelbuettel [mailto:notifications@github.com] Gesendet: Montag, 22. Juli 2019 12:47 An: eddelbuettel/rcppcnpy Cc: Meid, Andreas; Mention Betreff: Re: [eddelbuettel/rcppcnpy] Different Result via numpy Read (#25)

That STILL hasen't made it reproducible as data.csv is not available.

Also in the code you show you use as.numeric() on the R side. Maybe you meant as.integer if you wanted integer on both sides?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/eddelbuettel/rcppcnpy/issues/25?email_source=notifications&email_token=AE2MTV5I5TOL6GELOB4FRELQAWFY3A5CNFSM4IEYNNR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2PRA6A#issuecomment-513740920, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE2MTV6KKWOYFE7YEU3JY63QAWFY3ANCNFSM4IEYNNRQ.

eddelbuettel commented 4 years ago

Please do me a favour and post complete and reproducible steps. It is, quite frankly, a little tedious that I still have to beg you about that even after five or emails or messages.

edd@rob:/tmp/rcppcnpy-bugreport(master)$ ./createData.py 
Traceback (most recent call last):
  File "./createData.py", line 11, in <module>
    y = np.array(df["cvd"]).astype(np.int32)
  File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 2688, in __getitem__
    return self._getitem_column(key)
  File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 2695, in _getitem_column
    return self._get_item_cache(key)
  File "/usr/lib/python2.7/dist-packages/pandas/core/generic.py", line 2491, in _get_item_cache
    values = self._data.get(item)
  File "/usr/lib/python2.7/dist-packages/pandas/core/internals.py", line 4115, in get
    loc = self.items.get_loc(item)
  File "/usr/lib/python2.7/dist-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'cvd'
edd@rob:/tmp/rcppcnpy-bugreport(master)$ 

Code does not write itself and I still cannot reproduce your issue. My patience and willingness to debug your problem goes down with each attempt.

andreasmeid commented 4 years ago

Puh, sorry that I‘m not that familiar in Python; I took the code lines from another project and, honestly, was happy enough that the Python code ran at all. In R, there seems to be no problem comparing the output with the source file. So, pragmatically, I’ll avoid Python for this step.

Von: Dirk Eddelbuettel [mailto:notifications@github.com] Gesendet: Montag, 22. Juli 2019 13:03 An: eddelbuettel/rcppcnpy Cc: Meid, Andreas; Mention Betreff: Re: [eddelbuettel/rcppcnpy] Different Result via numpy Read (#25)

Please do me a favour and post complete and reproducible steps. It is, quite frankly, a little tedious that I still have to beg you about that even after five or emails or messages.

edd@rob:/tmp/rcppcnpy-bugreport(master)$ ./createData.py

Traceback (most recent call last):

File "./createData.py", line 11, in

y = np.array(df["cvd"]).astype(np.int32)

File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 2688, in getitem

return self._getitem_column(key)

File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 2695, in _getitem_column

return self._get_item_cache(key)

File "/usr/lib/python2.7/dist-packages/pandas/core/generic.py", line 2491, in _get_item_cache

values = self._data.get(item)

File "/usr/lib/python2.7/dist-packages/pandas/core/internals.py", line 4115, in get

loc = self.items.get_loc(item)

File "/usr/lib/python2.7/dist-packages/pandas/core/indexes/base.py", line 3080, in get_loc

return self._engine.get_loc(self._maybe_cast_indexer(key))

File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc

File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc

File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item

File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item

KeyError: 'cvd'

edd@rob:/tmp/rcppcnpy-bugreport(master)$

Code does not write itself and I still cannot reproduce your issue. My patience and willingness to debug your problem goes down with each attempt.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/eddelbuettel/rcppcnpy/issues/25?email_source=notifications&email_token=AE2MTV73IAKMJEKJX5PJODLQAWHVNA5CNFSM4IEYNNR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2PSFXQ#issuecomment-513745630, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE2MTV4HFJTHIGJ567JPJ5TQAWHVNANCNFSM4IEYNNRQ.

eddelbuettel commented 4 years ago

For closure: This was operator error. The Python file we compared against was written the wrong way. Even though R only has 32-bit integers, we need 64 bit integers from Python. So something like the following worked:

#!/usr/bin/python

import pandas as pd
import numpy as np

base_dir = "newdata"
df = pd.read_csv("data.csv")
## NB 1: data file does not correspond to code example, column is called 'x'
## NB 2" cast to int64 is important
y = np.array(df["x"]).astype(np.int64)
#print(y.dtype)
np.save(base_dir + "/y.npy", y)