Failure running my ML workflows

kxk302 commented 3 years ago

I have 3 workflows that use Galaxy's ML tools (namely Keras for neural networks). They all worked fine last time I ran them (maybe a month ago?).

These 3 workflows are used in 3 neural network tutorials that I am presenting at GCC 2021. I decided to re-run them to make sure all is good. All 3 workflows fail now. Here is the error message for the first 2 workflows:

Traceback (most recent call last):
  File "/data/share/staging/21069371/tool_files/keras_train_and_eval.py", line 491, in <module>
    targets=args.targets, fasta_path=args.fasta_path)
  File "/data/share/staging/21069371/tool_files/keras_train_and_eval.py", line 405, in main
    estimator.fit(X_train, y_train)
  File "/data/share/tools/_conda/envs/mulled-v1-26f90eb9c8055941081cb6eaef4d0dffb23aadd383641e5d6e58562e0bb08f59/lib/python3.6/site-packages/galaxy_ml/keras_galaxy_models.py", line 911, in fit
    return super(KerasGRegressor, self)._fit(X, y, **kwargs)
  File "/data/share/tools/_conda/envs/mulled-v1-26f90eb9c8055941081cb6eaef4d0dffb23aadd383641e5d6e58562e0bb08f59/lib/python3.6/site-packages/galaxy_ml/keras_galaxy_models.py", line 644, in _fit
    validation_data = self.validation_data

Here are the histories:

Per @anuprulez' suggestion, I downgraded the tool versions and the first and second workflow work now. Below is the downgrade:

Create a deep learning model architecture: downgraded to 0.4.2
Create a deep learning model with an optimizer, loss function and fit parameters: downgraded 0.4.2
Deep learning training and evaluation conduct deep training and evaluation either implicitly or explicitly: downgraded to 1.0.8.2

The third workflow still fails. BTW, it requires the most recent version of the third tool.

I started writing unit tests in galaxytools (https://github.com/kxk302/galaxytools/tree/nn_tests), so these workflows are run as part of the unit test. They would serve as regression tests and would guarantee future changes would not break old code. However, I ran into another issue: models saved to file cannot be loaded and error out. Not sure if this is related to the workflow error above. Here is the error message:

unzip cnn.zip
Archive: cnn.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of cnn.zip or
cnn.zip.zip, and cannot find cnn.zip.ZIP, period.

kxk302 commented 3 years ago

@anuprulez I just ran my third workflow (CNN workflow) on galaxy.eu and it failed. Could you please check the log to see what error message we get? Thanks.

kxk302 commented 3 years ago

I only see "Failed to communicate with remote job server."

anuprulez commented 3 years ago

@kxk302 in the first and third histories, I don't have permission to see those datasets. Can you unlock those?

kxk302 commented 3 years ago

Update: I re-ran the third history after the initial failure and it completed successfully.

@anuprulez how do I unlock the datasets? I don't see an option when trying to share history. If you want we can use Gitter to resolve this. Thx

anuprulez commented 3 years ago

I see some changes have been made to: https://github.com/goeckslab/Galaxy-ML/tree/master/galaxy_ml very recently

kxk302 commented 3 years ago

https://github.com/goeckslab/Galaxy-ML/tree/master/galaxy_ml

Yes, there was a bug fix in Galaxy-ML that was pushed recently.

kxk302 commented 3 years ago

Here are the links to all workflows and datasets for histories:

First history:

https://training.galaxyproject.org/training-material/topics/statistics/tutorials/FNN/workflows/ https://zenodo.org/record/4660497/files/X_test.tsv https://zenodo.org/record/4660497/files/X_train.tsv https://zenodo.org/record/4660497/files/y_test.tsv https://zenodo.org/record/4660497/files/y_train.tsv

kxk302 commented 3 years ago

Second history:

https://training.galaxyproject.org/training-material/topics/statistics/tutorials/RNN/workflows/ https://zenodo.org/record/4477881/files/X_test.tsv https://zenodo.org/record/4477881/files/X_train.tsv https://zenodo.org/record/4477881/files/y_test.tsv https://zenodo.org/record/4477881/files/y_train.tsv

kxk302 commented 3 years ago

Third history:

https://training.galaxyproject.org/training-material/topics/statistics/tutorials/CNN/workflows/ https://zenodo.org/record/4697906/files/X_train.tsv https://zenodo.org/record/4697906/files/y_train.tsv https://zenodo.org/record/4697906/files/X_test.tsv https://zenodo.org/record/4697906/files/y_test.tsv

kxk302 commented 3 years ago

You need to re-name the uploaded files and change their type to tabular, before running the workflows. Thx.

anuprulez commented 3 years ago

Second history:

https://training.galaxyproject.org/training-material/topics/statistics/tutorials/RNN/workflows/ https://zenodo.org/record/4477881/files/X_test.tsv https://zenodo.org/record/4477881/files/X_train.tsv https://zenodo.org/record/4477881/files/y_test.tsv https://zenodo.org/record/4477881/files/y_train.tsv

I get these errors while running this workflow

galaxy_error

rnn_error_1

kxk302 commented 3 years ago

@anuprulez did you downgrade the tool versions in the RNN workflow?

anuprulez commented 3 years ago

No, I just ran it

On Thu, May 13, 2021, 8:15 PM kxk302 @.***> wrote:

@anuprulez https://github.com/anuprulez did you downgrade the tool versions in the RNN workflow?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bgruening/galaxytools/issues/1115#issuecomment-840610156, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXB5NQQ7XM3ZTQDEWAKSYLTNPQYHANCNFSM442SE2IA .

kxk302 commented 3 years ago

If you downgrade the tool versions as I documented, it will work.

kxk302 commented 3 years ago

I guess the questions is why it stopped working with the new versions of those tools.

qiagu commented 3 years ago

Try to check various package versions in the conda environment, python version as well (make sure python 3.6). The conda includes a lot of members, prone to make errors when a newer package joins the team.

kxk302 commented 3 years ago

Thanks @qiagu,

Could you please provide more info on how to do that?

qiagu commented 3 years ago

Sorry, I just say a general debugging process, not specific to any issue mentioned in this thread. From the stderr report @anuprulez provided, I feel the errors could be cleared by re-cleaning the input TSVs.

qiagu commented 3 years ago

Try to ensure the classification targets are integers, not float.

kxk302 commented 3 years ago

I do not see the errors that Anup sees. I guess the first step would be to get these workflows working with older versions of the tools. Then we can use the new version to re-produce the problem. @anuprulez not sure what your internet connectivity is like, but we could possibly have a Zoom meeting to discuss tomorrow (Friday). I'm free from 8:00 am to 10:100 am EST time.

mvdbeek commented 3 years ago

I only see "Failed to communicate with remote job server."

That's a job running error, you'll want to check this with Nate, that is not a tool error.

kxk302 commented 3 years ago

I only see "Failed to communicate with remote job server."

That's a job running error, you'll want to check this with Nate, that is not a tool error.

This is run on EU. I remember vaguely Bjorn saying that some jobs are configured to run on GPU and this error would show up then, and the error would go away when job was run on CPU. Am I right @bgruening?

bgruening / galaxytools

Failure running my ML workflows #1115