About _epoch_train and _epoch_val

fireholder commented 5 years ago

When i was traning, I've met a problem that the progress came to a standstill. And I've found that it was the function _epoch_train and _epoch_val stopped it, which raises NotImplementedError. I wonder why and how to fix it.

Ike-yang commented 5 years ago

hi, bro, I am trying to run the trainer.py, but I don't know about the argument "--load_model_path", there is nothing in the current folder, I am sure what kind of pretrain model need to load here, any advise?

fireholder commented 5 years ago

I think '--load_model_path' is only used when 'pretrained', but the log.txt shows error when not loading model files.

Ike-yang commented 5 years ago

Exactly, I got something in the logs.txt file like this : Vocab Size:1173 [Load Model Failed] [Errno 2] No such file or directory: '' [Load Model Failed] [Errno 21] Is a directory: '.' [Load MLC Failed [Errno 21] Is a directory: '.'!] [Load Co-attention Failed [Errno 21] Is a directory: '.'!] [Load Sentence model Failed [Errno 21] Is a directory: '.'!] [Load Word model Failed [Errno 21] Is a directory: '.'!] Namespace(attention_version='v4', batch_size=16, caption_json='./data/new_data/.......

I thought program just stop here because of the error message. So, I could just ignore the message, and keep training? Are there other places need to be modified?

fireholder commented 5 years ago

I find that it's not stopped, it's just not printed.

Ike-yang commented 5 years ago

Yeah, I leave it to run all night, but I found val_loss is always 0 in logs.txt, there must something wrong and need to be modified

fireholder commented 5 years ago

Because in '_epoch_val' all val loss is set to 0, you can try uncomenting the code in '_epoch_val'. But I find my train loss very large, is it the same to you? By the way, have you tried the tester

Ike-yang commented 5 years ago

Yes, extremely large train loss. Haven't tried the tester yet

Ike-yang commented 5 years ago

I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely?

fireholder commented 5 years ago

Yes, just convert to tensor.cpu() as the error suggested.

fireholder commented 5 years ago

However , My test results are all the Same. All my predicted captions are the same

------------------ 原始邮件 ------------------ 发件人: "Ike-yang"notifications@github.com; 发送时间: 2019年8月7日(星期三) 中午12:26 收件人: "ZexinYan/Medical-Report-Generation"Medical-Report-Generation@noreply.github.com; 抄送: "横舟"xuwenting33@qq.com; "Author"author@noreply.github.com; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7)

I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

Cao-Shuang commented 5 years ago

I have the same caption too. Can you find the reason？------------------ 原始邮件 ------------------ 发件人: "xwt"notifications@github.com 发送时间: 2019年8月9日(星期五) 晚上9:47 收件人: "ZexinYan/Medical-Report-Generation"Medical-Report-Generation@noreply.github.com; 抄送: "Subscribed"subscribed@noreply.github.com; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7)

However , My test results are all the Same. All my predicted captions are the same

------------------ 原始邮件 ------------------ 发件人: "Ike-yang"notifications@github.com;
发送时间: 2019年8月7日(星期三) 中午12:26 收件人: "ZexinYan/Medical-Report-Generation"Medical-Report-Generation@noreply.github.com;
抄送: "横舟"xuwenting33@qq.com; "Author"author@noreply.github.com;
主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7)

I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

fireholder commented 5 years ago

not yet

ShivamPanchal commented 5 years ago

When I run python tester.py

FileNotFoundError: [Errno 2] No such file or directory: './data/new_data/debug_vocab.pkl'

CinKKKyo commented 4 years ago

Did u guys met the problem like"

WARNING:tensorflow:From /content/drive/Shared drives/shared drive-zma/ACL18/utils/logger.py:15: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

Traceback (most recent call last): File "/content/drive/Shared drives/shared drive-zma/ACL18/trainer.py", line 662, in debugger.train() File "/content/drive/Shared drives/shared drive-zma/ACL18/trainer.py", line 60, in train train_tag_loss, train_stop_loss, train_word_loss, train_loss = self._epoch_train() #??? File "/content/drive/Shared drives/shared drive-zma/ACL18/trainer.py", line 402, in _epoch_train batch_tag_loss = self.mse_criterion(tags, self._to_var(label, requires_grad=False)).sum() # ??? File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/loss.py", line 431, in forward return F.mse_loss(input, target, reduction=self.reduction) File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 2203, in mse_loss expanded_input, expanded_target = torch.broadcast_tensors(input, target) File "/usr/local/lib/python3.6/dist-packages/torch/functional.py", line 52, in broadcast_tensors return torch._C._VariableFunctions.broadcast_tensors(tensors)

RuntimeError: The size of tensor a (210) must match the size of tensor b (0) at non-singleton dimension 1 " it's really make me confused, anyone could do me a favor? Thx!

mfilipav commented 4 years ago

However , My test results are all the Same. All my predicted captions are the same … ------------------ 原始邮件 ------------------ 发件人: "Ike-yang"notifications@github.com; 发送时间: 2019年8月7日(星期三) 中午12:26 收件人: "ZexinYan/Medical-Report-Generation"Medical-Report-Generation@noreply.github.com; 抄送: "横舟"xuwenting33@qq.com; "Author"author@noreply.github.com; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7) I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

Hi @fireholder! Did you eventually give up trying to solve the issue? were all the predicted captions always identical?

yangyan22 commented 4 years ago

My train loss is also very large. And all my predicted captions are the same: "No acute cardiopulmonary abnormality", could anyone do me a favor? Thx! Is it because of Python2 and Python3, since I used python3.

AnkitMalviya commented 4 years ago

Yes, extremely large train loss. Haven't tried the tester yet

Hi, you were able to decrease the loss. I am also facing the same issue.

AnkitMalviya commented 4 years ago

I have the same caption too. Can you find the reason？------------------ 原始邮件 ------------------ 发件人: "xwt"notifications@github.com 发送时间: 2019年8月9日(星期五) 晚上9:47 收件人: "ZexinYan/Medical-Report-Generation"Medical-Report-Generation@noreply.github.com; 抄送: "Subscribed"subscribed@noreply.github.com; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7) However , My test results are all the Same. All my predicted captions are the same … ------------------ 原始邮件 ------------------ 发件人: "Ike-yang"notifications@github.com; 发送时间: 2019年8月7日(星期三) 中午12:26 收件人: "ZexinYan/Medical-Report-Generation"Medical-Report-Generation@noreply.github.com; 抄送: "横舟"xuwenting33@qq.com; "Author"author@noreply.github.com; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7) I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

I am also facing the same issue. Are you able to solve this?

Alsalivan commented 4 years ago

My train loss is also very large. And all my predicted captions are the same: "No acute cardiopulmonary abnormality", could anyone do me a favor? Thx! Is it because of Python2 and Python3, since I used python3.

I guess train loss is large, because author uses MSELoss for predicting tags. If he has 156 different tags, then the exponent ~ (156-0)^2 = 24336. That is why so big loss

You can change it L1Loss or decrease lambda argument for tags loss (if you find it reasonable).

Hareem1997 commented 4 years ago

In debugger.py and tester.py file of the given project. I'm facing an error at 3rd last line in the following given section of code. ` tag_loss += self.args.lambda_tag batch_tag_loss.data stop_loss += self.args.lambda_stop batch_stop_loss.data word_loss += self.args.lambda_word * batch_word_loss.data loss += batch_loss.data

return tag_loss, stop_loss, word_loss, loss`

Error is : File "D:/Hareem/Auto_report/debugger.py", line 61, in train train_tag_loss, train_stop_loss, train_word_loss, train_loss = self._epoch_train() File "D:/Hareem/Auto_report/debugger.py", line 424, in _epoch_train word_loss += self.args.lambda_word * batch_word_loss.data AttributeError: 'int' object has no attribute 'data'

domyown commented 2 years ago

Is there anybody who solve the problem predicting captions all the same?

ZexinYan / Medical-Report-Generation

About _epoch_train and _epoch_val #7