dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
3.99k stars 543 forks source link

Package for conda #407

Closed fgregg closed 8 years ago

fgregg commented 8 years ago

Dedupe is really pretty hard to install for windows users. It would be great to create a conda package.

Shotgunosine commented 8 years ago

I'm a windows users running python through conda, and I have not been able to get dedupe installed.

When I try to pip install i get errors building the wheels for pylbfgs and pyhacrf. pylbfgs: In file included from liblbfgs/lbfgs.c:68:0: compat/win32/stdint.h:33:2: error: #error "Use this header only with Microsoft Visual C++ compilers!" error: command 'C:\Users\User\AppData\Local\Continuum\Anaconda\Scripts\gcc.bat' failed with exit status 1

pyhacrf: In file included from liblbfgs/lbfgs.c:68:0: compat/win32/stdint.h:33:2: error: #error "Use this header only with Microsoft Visual C++ compilers!" error: command 'C:\Users\User\AppData\Local\Continuum\Anaconda\Scripts\gcc.bat' failed with exit status 1

Do you have any ideas how I might circumvent that?

Shotgunosine commented 8 years ago

Your change in pylbfgs fixed that, but pyhacrf still gives an error:

  compile options: '-D__MSVCRT_VERSION__=0x0900 -IC:\\Users\\User\\AppData\\Local\\Continuum\\Anaconda\\lib\\site-packages\\numpy
\\core\\include -IC:\Users\User\AppData\Local\Continuum\Anaconda\lib\site-packages\numpy\core\include -IC:\Users\User\AppData\
Local\Continuum\Anaconda\include -IC:\Users\User\AppData\Local\Continuum\Anaconda\PC -c'
  gcc -m64 -g -DNDEBUG -DMS_WIN64 -O2 -Wall -Wstrict-prototypes -D__MSVCRT_VERSION__=0x0900 -IC:\\Users\\User\\AppData\\Local\\Co
ntinuum\\Anaconda\\lib\\site-packages\\numpy\\core\\include -IC:\Users\User\AppData\Local\Continuum\Anaconda\lib\site-packages\nu
mpy\core\include -IC:\Users\User\AppData\Local\Continuum\Anaconda\include -IC:\Users\User\AppData\Local\Continuum\Anaconda\PC
-c pyhacrf/algorithms.c -o build\temp.win-amd64-2.7\Release\pyhacrf\algorithms.o
  Found executable C:\Users\User\AppData\Local\Continuum\Anaconda\Scripts\gcc.bat
  gcc -m64 -g -shared build\temp.win-amd64-2.7\Release\pyhacrf\algorithms.o -LC:\\Users\\User\\AppData\\Local\\Continuum\\Anacond
a\\lib\\site-packages\\numpy\\core\\lib -LC:\Users\User\AppData\Local\Continuum\Anaconda\libs -LC:\Users\User\AppData\Local\Co
ntinuum\Anaconda\PCbuild\amd64 -lnpymath -lpython27 -lmsvcr90 -o build\lib.win-amd64-2.7\pyhacrf\algorithms.pyd
  Warning: .drectve `/manifestdependency:"type='win32' name='Microsoft.VC90.CRT' version='9.0.21022.8' processorArchitecture='amd64'
 publicKeyToken='1fc8b3b9a1e18e3b'" /DEFAULTLIB:"python27.lib" /DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized
  Warning: .drectve `/manifestdependency:"type='win32' name='Microsoft.VC90.CRT' version='9.0.21022.8' processorArchitecture='amd64'
 publicKeyToken='1fc8b3b9a1e18e3b'" /DEFAULTLIB:"python27.lib" /DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized
  C:\\Users\\User\\AppData\\Local\\Continuum\\Anaconda\\lib\\site-packages\\numpy\\core\\lib/npymath.lib(build/temp.win-amd64-2.7
/build/src.win-amd64-2.7/numpy/core/src/npymath/npy_math.obj):(.text+0x2e3): undefined reference to `__imp_modff'
  collect2.exe: error: ld returned 1 exit status
  error: Command "gcc -m64 -g -shared build\temp.win-amd64-2.7\Release\pyhacrf\algorithms.o -LC:\\Users\\User\\AppData\\Local\\Co
ntinuum\\Anaconda\\lib\\site-packages\\numpy\\core\\lib -LC:\Users\User\AppData\Local\Continuum\Anaconda\libs -LC:\Users\User\
AppData\Local\Continuum\Anaconda\PCbuild\amd64 -lnpymath -lpython27 -lmsvcr90 -o build\lib.win-amd64-2.7\pyhacrf\algorithms.pyd" fai
led with exit status 1

  ----------------------------------------
  Failed building wheel for pyhacrf
fgregg commented 8 years ago

Hmmm... could you move that error over to pyhacrf. Pyhacrf is not a dependency for dedupe. Can you now build a wheel for dedupe?

Shotgunosine commented 8 years ago

Still no luck building dedupe, pyhacrf is a dependency for highered, which seems to be a dependency for dedupe, I'll post the issue in pyhacrf though. Thanks again for your help

fgregg commented 8 years ago

Hmm... well, let's work on getting that solved. In the meantime, I think I'll make highered an optional dependency.

Shotgunosine commented 8 years ago

As per the discussion at dirko/pyhacrf#22 I'm still having some issues there, but even with the not fully functioning pyhacrf I was able to get dedupe installed and it seems to be working:

Are there any other tests I should do? What trouble will I run into with pyhacrf not working?

C:\Users\User\Documents\code\dedupe\dedupe\backport.py:22: UserWarning: Dedupe does not currently support multiprocessing on Wind
ows
  warnings.warn("Dedupe does not currently support multiprocessing on Windows")
......C:\Users\User\Documents\code\dedupe\dedupe\core.py:67: UserWarning: Requested sample of size 25, only returning 21 possible
 pairs
  warnings.warn("Requested sample of size %d, only returning %d possible pairs" % (sample_size, n))
...............C:\Users\User\Documents\code\dedupe\dedupe\crossvalidation.py:83: UserWarning: Only providing 4 folds out of 10 re
quested
  (i, k))
.....................................
----------------------------------------------------------------------
Ran 58 tests in 11.294s

OK

C:\Users\User\Documents\code\dedupe>python tests\\canonical_test.py
C:\Users\User\AppData\Local\Continuum\Anaconda\lib\site-packages\dedupe\backport.py:22: UserWarning: Dedupe does not currently su
pport multiprocessing on Windows
  warnings.warn("Dedupe does not currently support multiprocessing on Windows")
number of known duplicate pairs 112
clustering...
Evaluate Clustering
found duplicate
104
precision
0.971153846154
recall
0.901785714286
ran in  23.4309999943 seconds

C:\Users\User\Documents\code\dedupe>python tests\\canonical_test_matching.py
C:\Users\User\AppData\Local\Continuum\Anaconda\lib\site-packages\dedupe\backport.py:22: UserWarning: Dedupe does not currently su
pport multiprocessing on Windows
  warnings.warn("Dedupe does not currently support multiprocessing on Windows")
number of known duplicate pairs 112
clustering...
Evaluate Clustering
found duplicate
106
precision
0.990566037736
recall
0.9375
ran in  57.361000061 seconds
fgregg commented 8 years ago

No, it should work fine. We use pyhacrf for an optional string distance, and just don't choose that option. Once you get this working, would you be willing to help set up recipes for https://conda.binstar.org/

Shotgunosine commented 8 years ago

I'd be happy to help, though I've not set up any recipes there before. Is there a walkthrough for the process somewhere?

fgregg commented 8 years ago

Yes, here's documentation here: http://docs.binstar.org/draft/examples.html#SubmitYourFirstBuild

On Mon, Sep 14, 2015 at 10:30 AM, Dylan Nielson notifications@github.com wrote:

I'd be happy to help, though I've not set up any recipes there before. Is there a walkthrough for the process somewhere?

— Reply to this email directly or view it on GitHub https://github.com/datamade/dedupe/issues/407#issuecomment-140116423.

773.888.2718

fgregg commented 8 years ago

I'm going to just build window wheels #453