Xboost BUG: - Githubissues

MmasterT commented 1 year ago

Describe the bug

Final step of the pipeline is failing for some parameter reason as described in this stackoverflow issue:

https://stackoverflow.com/questions/66491801/i-got-this-error-dataframe-dtypes-for-data-must-be-int-float-bool-or-categori

To Reproduce I've cloned the repo and changed some of the configs to run in a slurm context with no internet access. Everthing is creates and analyzed as expected but the final file.

sbatch -p ei-cb -J predector_test -o predector_test.%j.log -c 1 --mem 10G --wrap " source nextflow-22.04.0_CBG && nextflow run ~/singularity/predector/predector/main.nf --phibase /ei/cb/common/Databases/predector/phi-base_current.fas --pfam_hmm /ei/cb/common/Databases/predector/Pfam-A.hmm.gz --pfam_dat /ei/cb/common/Databases/predector/Pfam-A.hmm.dat.gz --dbcan /ei/cb/common/Databases/predector/dbCAN-HMMdb-V11.txt --effectordb /ei/cb/common/Databases/predector/effectordb.hmm.gz -profile test -with-singularity ~/singularity/predector/predector-1.2.7.sif -resume ~/singularity/predector/predector/ -c ~/singularity/predector/predector/nextflow.config -with-report"

Expected behavior Expeceted to get the *rank_result.tsv file of the test

Error Log Error executing process > 'rank_results (test_set)'

Caused by: Process rank_results (test_set) terminated with an error exit status (2)

Command executed:

predutils load_db --mem "2" tmp.db results.ldjson

predutils rank --mem "2" --dbcan dbcan.txt --pfam pfam.txt --outfile "test_set-ranked.tsv" --secreted-weight "2" --sigpep-good-weight "0.003" --sigpep-ok-weight "0.0001" --single-transmembrane-weight "-0.7" --multiple-transmembrane-weight "-1.0" --deeploc-extracellular-weight "1.3" --deeploc-intracellular-weight "-1.3" --deeploc-membrane-weight "-0.25" --targetp-mitochondrial-weight "-0.5" --effectorp1-weight "0.5" --effectorp2-weight "2.5" --effectorp3-apoplastic-weight "0.5" --effectorp3-cytoplasmic-weight "0.5" --effectorp3-noneffector-weight "-2.5" --deepredeff-fungi-weight "0.1" --deepredeff-oomycete-weight "0.0" --effector-homology-weight "2" --virulence-homology-weight "0.5" --lethal-homology-weight "-2" --tmhmm-first-60-threshold "10" tmp.db

rm -f tmp.db

Command exit status: 2

Command output: (empty)

DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, DMatrix parameter enable_categorical must be set to True. Invalid columns:signalp3_nn_d Traceback (most recent call last): File "/opt/conda/envs/predector/lib/python3.9/site-packages/predectorutils/main.py", line 253, in main rank_runner(args) File "/opt/conda/envs/predector/lib/python3.9/site-packages/predectorutils/subcommands/rank.py", line 1577, in runner raise e File "/opt/conda/envs/predector/lib/python3.9/site-packages/predectorutils/subcommands/rank.py", line 1575, in runner inner(con, cur, args) File "/opt/conda/envs/predector/lib/python3.9/site-packages/predectorutils/subcommands/rank.py", line 1561, in inner df["effector_score"] = run_ltr(df) File "/opt/conda/envs/predector/lib/python3.9/site-packages/predectorutils/subcommands/rank.py", line 1503, in run_ltr dmat = xgb.DMatrix(df_features) File "/opt/conda/envs/predector/lib/python3.9/site-packages/xgboost/core.py", line 532, in inner_f return f(**kwargs) File "/opt/conda/envs/predector/lib/python3.9/site-packages/xgboost/core.py", line 643, in init handle, feature_names, feature_types = dispatch_data_backend( File "/opt/conda/envs/predector/lib/python3.9/site-packages/xgboost/data.py", line 896, in dispatch_data_backend return _from_pandas_df(data, enable_categorical, missing, threads, File "/opt/conda/envs/predector/lib/python3.9/site-packages/xgboost/data.py", line 345, in _from_pandas_df data, feature_names, feature_types = _transform_pandas_df( File "/opt/conda/envs/predector/lib/python3.9/site-packages/xgboost/data.py", line 283, in _transform_pandas_df _invalid_dataframe_dtype(data) File "/opt/conda/envs/predector/lib/python3.9/site-packages/xgboost/data.py", line 247, in _invalid_dataframe_dtype raise ValueError(msg) ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, DMatrix parameter enable_categorical must be set to True. Invalid columns:signalp3_nn_d

Operating system (please enter the following information as appropriate):

OS/Linux distribution: CentOS
Dependency management: Singularity
Linux HPC

Additional context I think changin the xgb.DMatrix(df_features) to xgb.DMatrix(df_features, enable_categorical=True) shoould do the fix.

darcyabjones commented 1 year ago

Hey MmasterT,

Sorry for the major delay responding to you.

I've finally had some time to come back to this.The issue you're experiencing seems like it might be caused by SignalP3-NN failing. It's old software and when something goes wrong it doesn't tell you, so it's hard to raise an error. It seems like the column "signalp3_nn_d" (which should be a float) is probably all missing values (because the SignalP3 runs all failed), so xgboost can't determine the right data-type.

It's hard to tell without really delving into the run but that's the issue i've encountered before that caused something like this. I've seem some issues with the SignalP3 neural network model in the past, which was caused by a compiled component seg-faulting and then the perl-scripts catching the error and not reporting it.

It's possibly related to another issue people have reported with SignalP4. I think maybe the glibc libraries in some linux distributions have changed or something. I'll keep you updated.

Rowena-h commented 1 year ago

Hi @darcyabjones, have you made any progress with this issue? Alternatively, is it possible to run SignalP3 independently and then integrate the results into the Predector run? Thanks!

kevynaguirre commented 3 months ago

Hi, Did you guys solve this bug?

darcyabjones commented 3 months ago

Hi everyone,

Again apologies for my lateness. I did look into it further last year but couldn't reproduce the problem.

I'll look at it again tomorrow while i'm updating the install scripts.

Regarding running SignalP3 separately. Yes that's absolutely possible. I'll add some documentation when i can, but basically you'd process the SignalP3 results (like this https://github.com/ccdmb/predector/blob/3d2a591fadbe7c398c1ac398371b3b2610a60d46/modules/processes.nf#L496-L500) and then provide it as precomputed results (https://github.com/ccdmb/predector/wiki#providing-pre-computed-results-to-skip-already-processed-proteins).

A+

ccdmb / predector

Xboost BUG: #93