bitextor / bicleaner

Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
GNU General Public License v3.0
150 stars 22 forks source link

scipy needs to be upgraded to 1.7.1, and joblib to 1.0.1 #56

Closed lpla closed 3 years ago

lpla commented 3 years ago

Hi. I tried to install bicleaner in Fedora 34 (Python 3.9.6) and got this error:

$ pip3 install bicleaner
Collecting bicleaner
  Downloading bicleaner-0.14-py3-none-any.whl (70 kB)
     |████████████████████████████████| 70 kB 3.6 MB/s 
Collecting pycld2==0.31
  Using cached pycld2-0.31.tar.gz (14.3 MB)
Collecting toolwrapper==0.4.1
  Using cached toolwrapper-0.4.1.tar.gz (2.7 kB)
Collecting sacremoses==0.0.43
  Using cached sacremoses-0.0.43.tar.gz (883 kB)
Requirement already satisfied: fasttext==0.9.2 in /home/lpla/bitextorment/lib/python3.9/site-packages (from bicleaner) (0.9.2)
Collecting joblib==0.14.1
  Using cached joblib-0.14.1-py2.py3-none-any.whl (294 kB)
Collecting PyYAML==5.1.2
  Downloading PyYAML-5.1.2.tar.gz (265 kB)
     |████████████████████████████████| 265 kB 8.2 MB/s 
Requirement already satisfied: scikit-learn==0.22.1 in /home/lpla/bitextorment/lib/python3.9/site-packages (from bicleaner) (0.22.1)
Collecting pytest==5.1.2
  Using cached pytest-5.1.2-py3-none-any.whl (224 kB)
Requirement already satisfied: numpy>=1.18.1 in /home/lpla/bitextorment/lib/python3.9/site-packages (from bicleaner) (1.21.2)
Collecting regex==2019.08.19
  Using cached regex-2019.08.19.tar.gz (654 kB)
Collecting scipy==1.4.1
  Using cached scipy-1.4.1.tar.gz (24.6 MB)
  Installing build dependencies ... /
. 
.
.
[lots of gfortran errors]
.
.
.

  Warning: Unused dummy argument ‘itry’ at (1) [-Wunused-dummy-argument]                                                                                                                                   
  stat.h:8:35:                                                                                                                                                                                             

  Warning: Unused variable ‘t4’ declared at (1) [-Wunused-variable]                                                                                                                                        
  stat.h:8:39:                                                                                                                                                                                             

  Warning: Unused variable ‘t5’ declared at (1) [-Wunused-variable]                                                                                                                                        
  gfortran:f77: scipy/sparse/linalg/eigen/arpack/ARPACK/UTIL/icnteq.f                                                                                                                                      
  gfortran:f77: scipy/sparse/linalg/eigen/arpack/ARPACK/UTIL/zmout.f                                                                                                                                       
  gfortran:f77: scipy/sparse/linalg/eigen/arpack/ARPACK/UTIL/ivout.f                                                                                                                                       
  gfortran:f77: scipy/sparse/linalg/eigen/arpack/ARPACK/UTIL/icopy.f                                                                                                                                       
  gfortran:f77: scipy/sparse/linalg/eigen/arpack/ARPACK/SRC/ssgets.f                                                                                                                                       
  stat.h:8:27:                                                                                                                                                                                             

  Warning: Unused variable ‘t2’ declared at (1) [-Wunused-variable]                                                                                                                                        
  stat.h:8:31:                                                                                                                                                                                             

  Warning: Unused variable ‘t3’ declared at (1) [-Wunused-variable]                                                                                                                                        
  stat.h:8:35:                                                                                                                                                                                             

  Warning: Unused variable ‘t4’ declared at (1) [-Wunused-variable]                                                                                                                                        
  stat.h:8:39:                                                                                                                                                                                             

  Warning: Unused variable ‘t5’ declared at (1) [-Wunused-variable]                                                                                                                                        
  gfortran:f77: scipy/sparse/linalg/eigen/arpack/ARPACK/UTIL/cvout.f                                                                                                                                       
  gfortran:f77: scipy/sparse/linalg/eigen/arpack/ARPACK/UTIL/svout.f                                                                                                                                       
  gfortran:f77: scipy/sparse/linalg/eigen/arpack/ARPACK/UTIL/zvout.f                                                                                                                                       
  gfortran:f77: scipy/sparse/linalg/eigen/arpack/ARPACK/UTIL/dmout.f                                                                                                                                       
  gfortran:f77: /tmp/pip-install-d0kuxk2u/scipy_5d8cd29663cd4698ab5d552d2e8f9474/scipy/_build_utils/src/wrap_dummy_g77_abi.f                                                                               
  error: Command "/usr/bin/gfortran -Wall -g -ffixed-form -fno-second-underscore -fPIC -O3 -funroll-loops -Iscipy/sparse/linalg/eigen/arpack/ARPACK/SRC -I/tmp/pip-build-env-36ar03rx/overlay/lib64/python3.9/site-packages/numpy/core/include -c -c scipy/sparse/linalg/eigen/arpack/ARPACK/SRC/dsaup2.f -o build/temp.linux-x86_64-3.9/scipy/sparse/linalg/eigen/arpack/ARPACK/SRC/dsaup2.o" failed with exit status 1                                                                                                                                                                                                         
  ----------------------------------------                                                                                                                                                                 
  ERROR: Failed building wheel for scipy                                                                                                                                                                   
Failed to build scipy                                                                                                                                                                                      
ERROR: Could not build wheels for scipy which use PEP 517 and cannot be installed directly

Manually upgrading to scipy==1.7.1 in requirements.txt and reinstalling bicleaner fixed it, at least for my Bitextor tests. Please, bump scipy version if there is no other regression or side effect with that.

EDIT: seems like joblib also needs to be updated to make training work. Tried running bicleaner-train in Bitextor and it crashed until I updated joblib to 1.0.1. Logs:

$ time bicleaner-train $training -S "/home/lpla/bitextorment/lib64/python3.9/site-packages/bitextor/data/moses/tokenizer/tokenizer.perl -q -b -a -l en" -T "/home/lpla/bitextorment/lib64/python3.9/site-packages/bitextor/data/moses/tokenizer/tokenizer.perl -q -b -a -l fr" --treat_oovs --normalize_by_length -s en -t fr -d /home/lpla/permanent/en-fr.dic.generated.lex.e2f.gz -D /home/lpla/permanent/en-fr.dic.generated.lex.f2e.gz -f /home/lpla/transient-genbicleaner-en-fr/tempgizamodel.en-fr/corpus.en.filtered.vcb.gz -F /home/lpla/transient-genbicleaner-en-fr/tempgizamodel.en-fr/corpus.fr.filtered.vcb.gz -c $DIR/en-fr.classifier -m /home/lpla/bicleaner-model/new/new-en-fr.yaml --classifier_type random_forest

--------------------------------------------------------------------------------
LokyProcess-66 failed with traceback: 
--------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lpla/bitextorment/lib64/python3.9/site-packages/joblib/externals/loky/backend/popen_
loky_posix.py", line 195, in <module>
    process_obj = pickle.load(from_parent)
  File "/home/lpla/bitextorment/lib64/python3.9/site-packages/joblib/externals/loky/backend/queues
.py", line 75, in __setstate__
    self._after_fork()
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 69, in _after_fork
    self._reset(after_fork=True)
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 73, in _reset
    self._notempty._at_fork_reinit()
AttributeError: '_SafeQueue' object has no attribute '_notempty'

--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
LokyProcess-65 failed with traceback: 
--------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lpla/bitextorment/lib64/python3.9/site-packages/joblib/externals/loky/backend/popen_
loky_posix.py", line 195, in <module>
    process_obj = pickle.load(from_parent)
  File "/home/lpla/bitextorment/lib64/python3.9/site-packages/joblib/externals/loky/backend/queues.py", line 75, in __setstate__
    self._after_fork()
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 69, in _after_fork
    self._reset(after_fork=True)
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 73, in _reset
    self._notempty._at_fork_reinit()
AttributeError: '_SafeQueue' object has no attribute '_notempty'

.
.
.
[20 like those}
.
.
.

--------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lpla/bitextorment/bin/bicleaner-train", line 11, in <module>
    main(sys.argv[1:])
  File "/home/lpla/bitextorment/bin/bicleaner-train", line 8, in main
    train.main(args)
  File "/home/lpla/bitextorment/lib64/python3.9/site-packages/bicleaner/bicleaner_train.py", line 
519, in main
    perform_training(args)
  File "/home/lpla/bitextorment/lib64/python3.9/site-packages/bicleaner/bicleaner_train.py", line 
501, in perform_training
    hgood, hwrong = train_classifier(features_train, features_test, args.classifier_type, args.cla
ssifier, Features(None, args.disable_features_quest, args.disable_lang_ident).titles)
  File "/home/lpla/bitextorment/lib64/python3.9/site-packages/bicleaner/bicleaner_train.py", line 
205, in train_classifier
    clf.fit(dataset['data'], dataset['target'])
  File "/home/lpla/bitextorment/lib64/python3.9/site-packages/sklearn/model_selection/_search.py",
 line 710, in fit
    self._run_search(evaluate_candidates)
  File "/home/lpla/bitextorment/lib64/python3.9/site-packages/sklearn/model_selection/_search.py",
 line 1151, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/home/lpla/bitextorment/lib64/python3.9/site-packages/sklearn/model_selection/_search.py",
 line 682, in evaluate_candidates
    out = parallel(delayed(_fit_and_score)(clone(base_estimator),
  File "/home/lpla/bitextorment/lib64/python3.9/site-packages/joblib/parallel.py", line 1017, in _
_call__
    self.retrieve()
  File "/home/lpla/bitextorment/lib64/python3.9/site-packages/joblib/parallel.py", line 909, in re
trieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/lpla/bitextorment/lib64/python3.9/site-packages/joblib/_parallel_backends.py", line 562, in wrap_future_result
    return future.result(timeout=timeout)
  File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 445, in result
    return self.__get_result()
  File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1)}

Thank you.

ZJaume commented 3 years ago

there you go

lpla commented 3 years ago

@ZJaume it is likely that joblib needs version bump too. I updated the issue with more info

ZJaume commented 3 years ago

done! :smile:

lpla commented 3 years ago

Tested with a clean install of bicleaner using pip3 install ./bicleaner and worked perfectly on Fedora 34. Thank you!