Problem with Retrain DL in Quickannotator

anuradhakar49 commented 3 years ago

Hi, The installation of the tool runs smoothly as described in the Github repository but I am encountering problems with retraining the deep learning model. For example, after adding 2 pairs of images in a new project, making patches and annotations and uploading them as training and test images, if we click "Retrain model" on the Project page, I am getting the ERROR: train_autoencoder (job N) failed. On the Annotations page, clicking the "Retrain DL" button displays an HTML error.

Please provide suggestions on how to resolve these errors. Anuradha Kar

choosehappy commented 3 years ago

Can you please provide the log files showing what the exact error is?

Unfortunately this information is too high-level for us to provide any insights

On Sun, Oct 17, 2021 at 7:46 PM Anuradha Kar @.***> wrote:

Hi, The installation of the tool runs smoothly as described in the Github repository but I am encountering problems with retraining the deep learning model. For example, after adding 2 pairs of images in a new project, making patches and annotations and uploading them as training and test images, if we click "Retrain model" on the Project page, I am getting the ERROR: train_autoencoder (job N) failed. On the Annotations page, clicking the "Retrain DL" button displays an HTML error.

Please provide suggestions on how to resolve these errors. Anuradha Kar

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/choosehappy/QuickAnnotator/issues/13, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJ3XTFJJTOXKGPSG63EPU3UHMDWXANCNFSM5GFBYJCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

mariokreutzfeldt commented 2 years ago

Hi @anuradhakar49 and @choosehappy

could you solve the issue? I'm having the same problem:

2021-11-25 13:54:10,872 [INFO] (THREAD 18304) About to train a new transfer model for try2 2021-11-25 13:54:10,887 INFO sqlalchemy.engine.base.Engine ROLLBACK 2021-11-25 13:54:10,887 [INFO] (THREAD 18304) ROLLBACK 2021-11-25 13:54:10,888 [INFO] (THREAD 18304) 127.0.0.1 - - [25/Nov/2021 13:54:10] "GET /api/try2/retrain_dl?frommodelid=0 HTTP/1.1" 404 -

System: Win10, python 3.8, cuda 10.2

Best regards, Mario

choosehappy commented 2 years ago

Sorry to hear this Mario!

Is this information you're putting here from the command line itself, or is it coming from the log file?

If you can send over the entire associated log file that would be appreciated

In the end, we were able to fix anuradhakar49's problem, it was environmental. if I remember correctly it was an incompatible cuda driver + cuda version? @tasvora may have additional info

tasvora commented 2 years ago

Yes it was environment issue related to cuda, but did not get to look at it in detail as Anuradha decided to use Linux and it worked fine there.

Regards Tasneem On Thu, Nov 25, 2021 at 10:28 AM choosehappy @.***> wrote:

Sorry to hear this Mario!

Is this information you're putting here from the command line itself, or is it coming from the log file?

If you can send over the entire associated log file that would be appreciated

In the end, we were able to fix anuradhakar49's problem, it was environmental. if I remember correctly it was an incompatible cuda driver + cuda version? @tasvora https://github.com/tasvora may have additional info

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/choosehappy/QuickAnnotator/issues/13#issuecomment-979303168, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMTB5DQ57VHROZ2KYZIFHXLUNZIZFANCNFSM5GFBYJCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

anuradhakar49 commented 2 years ago

Yes this issue is solved and was linked to cuda +torch versions. @mariokreutzfeldt Please check if you have a cuda compatible GPU and that your code is being able to access the GPU (i.e the GPU is not busy with another task) . Also make sure the pytorch version is compatible with cuda 10.2 (https://pytorch.org/get-started/previous-versions/) Else try a reinstall with torch CPU only version to test.

mariokreutzfeldt commented 2 years ago

Dear all, thank you for your fast replies!!

I have verified the CUDA installation via nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Wed_Oct_23_19:32:27_Pacific_Daylight_Time_2019 Cuda compilation tools, release 10.2, V10.2.89

and pytorch installation via torch.cuda.is_available() true

During installation of QA I ran into many unresolvable version issues. So I ended up installing the following.

numpy==1.17.3 Flask_SQLAlchemy==2.4.0 scikit_image==0.16.2 scikit_learn==0.24.0 opencv_python_headless==4.1.2.30 scipy==1.4.1 requests==2.22.0 SQLAlchemy==1.3.5 tensorboard==2.4.1 ttach==0.0.2 albumentations==0.4.3 config==0.4.2 Flask==1.0.3 Pillow==8.1.2 llvmlite==0.34.0 numba umap-learn Flask_Restless==0.17.0 python-openslide==1.1.2

For Pytorch I had the automatic installation already fail for another project, so I downloaded the packages manually. torch 1.8.1+cu102 torchaudio 0.10.0+cu102 torchvision 0.9.1+cu102

I installed torch first. When I installed torchaudio and torchvision it would deinstall torch and replace it with a non-cuda version. So I installed torch+cu102 again after having installed torchaudio and torchvision.

@choosehappy, the complete log is here

Best regards, Mario

mariokreutzfeldt commented 2 years ago

Quick additional info: replacing the CUDA with CPU versions of pytorch did not solve it. Still getting ERROR 404.

choosehappy commented 2 years ago

it does like this environment is really going to be the issue. those libraries have been tested to work together and is what is used to create e.g., our docker files

unfortunately this log file doesn't appear to contain anything interesting. can you as well upload all data.* files? there might be up to 3 of them:

data.db, data.db-shm, data.db-wal

On Fri, Nov 26, 2021 at 2:21 PM mariokreutzfeldt @.***> wrote:

Quick additional info: replacing the CUDA with CPU versions of pytorch did not solve it. Still getting ERROR 404.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/choosehappy/QuickAnnotator/issues/13#issuecomment-979974320, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJ3XTFPIFJZHJBYPDIYKLDUN6CVFANCNFSM5GFBYJCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

mariokreutzfeldt commented 2 years ago

@choosehappy here you go.

Doesn`t contain data.db-wal because the file was 0kb.

choosehappy commented 2 years ago

Okay, this database looks like it was cleaned out

It looks like you restarted quick annotator after you had the error, which by default goes through and clears out old jobs

Can you set this line:

https://github.com/choosehappy/QuickAnnotator/blob/7cf55b1939fc9ad73ccf6d5435b613bfb697c74c/config/config.ini#L7

to False

reproduce your error and send back over?

On Fri, Nov 26, 2021 at 4:11 PM mariokreutzfeldt @.***> wrote:

@choosehappy https://github.com/choosehappy here you go https://www.dropbox.com/t/wwWRuHA61zpwkpTn.

Doesn`t contain data.db-wal because the file was 0kb.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/choosehappy/QuickAnnotator/issues/13#issuecomment-980048964, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJ3XTEGSAAY67PI26OQC6DUN6PTZANCNFSM5GFBYJCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

tasvora commented 2 years ago

Also in addition to that.

If you could copy everything that you see on your console where u initiating the quick annotator application from and save it as a text file and send that too would help too, may be there is a specific library error we might be missing.

Regards Tasneem

On Fri, Nov 26, 2021 at 10:49 AM choosehappy @.***> wrote:

Okay, this database looks like it was cleaned out

It looks like you restarted quick annotator after you had the error, which by default goes through and clears out old jobs

Can you set this line:

https://github.com/choosehappy/QuickAnnotator/blob/7cf55b1939fc9ad73ccf6d5435b613bfb697c74c/config/config.ini#L7

to False

reproduce your error and send back over?

On Fri, Nov 26, 2021 at 4:11 PM mariokreutzfeldt @.***> wrote:

@choosehappy https://github.com/choosehappy here you go https://www.dropbox.com/t/wwWRuHA61zpwkpTn.

Doesn`t contain data.db-wal because the file was 0kb.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/choosehappy/QuickAnnotator/issues/13#issuecomment-980048964 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACJ3XTEGSAAY67PI26OQC6DUN6PTZANCNFSM5GFBYJCA

. Triage notifications on the go with GitHub Mobile for iOS < https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675

or Android < https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

—

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/choosehappy/QuickAnnotator/issues/13#issuecomment-980069979, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMTB5DSRLZCQ6UKRWSK4ELLUN6T7XANCNFSM5GFBYJCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

mariokreutzfeldt commented 2 years ago

Here are the log files and the data.db after changing the config. I am using the CPU version of pytorch now and have seen that one project is giving me a "not enough training/test images"..which makes sense. The second project is still giving error 400.

choosehappy commented 2 years ago

hmm...i think we'll have to jump on a call, these log files and database seem to indicate that things are working as expected : )

mariokreutzfeldt commented 2 years ago

Thank you @choosehappy and @tasvora for helping solve this issue! In case someone else is having this problem, it turned out that I had a broken svml_dispmd.dll (730kb instead of 18MB). Also, make sure scikit-image==0.18.1 is installed.

Best regards, Mario

stellaqu123 commented 2 years ago

Hi @choosehappy and @mariokreutzfeldt, I have the same problem about Retrain DL in Quickannotator. After annotating a patch, when I ran Retrain DL -From base, I got error message like "ERROR 404: (Unknown error)". The shotcut is as below . The console log is like "2022-06-09 08:49:11,130 INFO sqlalchemy.engine.base.Engine BEGIN (implicit) 2022-06-09 08:49:11,130 [INFO] (THREAD 139621868058368) BEGIN (implicit) 2022-06-09 08:49:11,131 INFO sqlalchemy.engine.base.Engine SELECT project.id AS project_id, project.name AS project_name, project.description AS project_description, project.date AS project_date, project.train_ae_time AS project_train_ae_time, project.make_patches_time AS project_make_patches_time, project.iteration AS project_iteration, project.embed_iteration AS project_embed_iteration FROM project WHERE project.name = ? LIMIT ? OFFSET ? 2022-06-09 08:49:11,131 [INFO] (THREAD 139621868058368) SELECT project.id AS project_id, project.name AS project_name, project.description AS project_description, project.date AS project_date, project.train_ae_time AS project_train_ae_time, project.make_patches_time AS project_make_patches_time, project.iteration AS project_iteration, project.embed_iteration AS project_embed_iteration FROM project WHERE project.name = ? LIMIT ? OFFSET ? 2022-06-09 08:49:11,131 INFO sqlalchemy.engine.base.Engine ('test1', 1, 0) 2022-06-09 08:49:11,131 [INFO] (THREAD 139621868058368) ('test1', 1, 0) 2022-06-09 08:49:11,131 [INFO] (THREAD 139621868058368) About to train a new transfer model for test1 2022-06-09 08:49:11,131 [INFO] (THREAD 139621868058368) About to train a new transfer model for test1 2022-06-09 08:49:11,132 INFO sqlalchemy.engine.base.Engine ROLLBACK 2022-06-09 08:49:11,132 [INFO] (THREAD 139621868058368) ROLLBACK 2022-06-09 08:49:11,132 [INFO] (THREAD 139621868058368) 124.126.17.86 - - [09/Jun/2022 08:49:11] "GET /api/test1/retrain_dl?frommodelid=0 HTTP/1.1" 404 -" According to your previous talk recordings, I checked my cuda version and pytorch version, which is compatible. pytorch installation via torch.cuda.is_available() true. Hoping I could get help about this issue. Best regards, Xiaoping

choosehappy commented 2 years ago

we can start by collecting more information: 1) operating system + version 2) python version 3) pip freeze output 4) cuda version 5) Nvidia GPU version

stellaqu123 commented 2 years ago

Sure.

operating system + version I use Amazon EC2 linux system. By using command "cat /proc/version", the version is "Linux version 4.14.238-125.422.amzn1.x86_64 (mockbuild@koji-pdx-corp-builder-64004) (gcc version 7.2.1 20170915 (Red Hat 7.2.1-2) (GCC)) #1 SMP Tue Jul 20 20:51:46 UTC 2021".
python version python 3.8.13
pip freeze output the output is here, pip_freeze_output.txt
cuda version with command "nvcc --version", the information is as below " nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2018 NVIDIA Corporation Built on Sat_Aug_25_21:08:01_CDT_2018 Cuda compilation tools, release 10.0, V10.0.130 " 5.Nvidia GPU version NVIDIA-SMI 450.142.00 Driver Version: 450.142.00 CUDA Version: 11.0 T4
torch version and cuda version torch version: 1.8.1+cu111 torch.cuda.is_availabel() return True

choosehappy commented 2 years ago

hmmm!! this all looks very reasonable!

is there any additional information in the console window at the top of the screen on the right?

In looking at the API itself and the console information you provided, the only 404 message that seems reasonable is here:

https://github.com/choosehappy/QuickAnnotator/blob/cafc757f51b48a2a1048bf7afded8d10a7d58637/QA_api.py#L147

This would seem to suggest that you don't have a base model already trained? is that the case?

if you look here:

https://github.com/choosehappy/QuickAnnotator/wiki/Image-List-Page

did you use the "3. (re)train model 0" button?

this step is needed to give good default weights

stellaqu123 commented 2 years ago

Thanks @choosehappy . I didn't use "3.(re)train model 0 "button before. When I use "3.（re)train model 0" button, I got error message in console, which is like " TypeError: Descriptors cannot not be created directly. If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0. If you cannot immediately regenerate your protos, some other possible workarounds are:

Downgrade the protobuf package to 3.20.x or lower.
Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower). ". After downgrade protobuf package to 3.19.1, “3 (re)train model 0” and Retrain DL function work. The problem is solved. Thanks for your help! 👍

choosehappy commented 2 years ago

Fantastic! so you're all set?

did you encounter this problem when using the provided docker file, or you were using in your own base operating system?

On Tue, Jun 14, 2022 at 12:31 PM stellaqu123 @.***> wrote:

Thanks @choosehappy https://github.com/choosehappy . I didn't use "3.(re)train model 0 "button before. When I use "3.（re)train model 0" button, I got error message in console, which is like " TypeError: Descriptors cannot not be created directly. If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0. If you cannot immediately regenerate your protos, some other possible workarounds are:

Downgrade the protobuf package to 3.20.x or lower.

Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower). ". After downgrade protobuf package to 3.19.1, “3 (re)train model 0” and Retrain DL function work. The problem is solved. Thanks for your help! 👍

— Reply to this email directly, view it on GitHub https://github.com/choosehappy/QuickAnnotator/issues/13#issuecomment-1155006261, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJ3XTGYXEQZ62ZCBQ7SRT3VPBNOHANCNFSM5GFBYJCA . You are receiving this because you were mentioned.Message ID: @.***>

stellaqu123 commented 2 years ago

yes. I could use Quickannotator Retrain DL function. I did't use docker. I just installed this package in my operating system.

choosehappy commented 2 years ago

Got it, thanks

Yes, protobuf can be a tricky one to maintain at the os level :)

On Thu, Jun 23, 2022, 11:21 stellaqu123 @.***> wrote:

yes. I could use Quickannotator Retrain DL function. I did't use docker. I just installed this package in my operating system.

— Reply to this email directly, view it on GitHub https://github.com/choosehappy/QuickAnnotator/issues/13#issuecomment-1164165931, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJ3XTC6LEVU4RPFDE7QCDDVQQUARANCNFSM5GFBYJCA . You are receiving this because you were mentioned.Message ID: @.***>

choosehappy / QuickAnnotator

Problem with Retrain DL in Quickannotator #13