antoine77340 / Youtube-8M-WILLOW

Kaggle Youtube 8M WILLOW approach
Apache License 2.0
466 stars 165 forks source link

Issues when test video/frame feature #8

Open feiyun1265 opened 6 years ago

feiyun1265 commented 6 years ago

Hi, @antoine77340. I have download youtube-8m dataset. Then, i use video/frame test folder test your pretrained model. But i occur a error when testing, error information as follows: INFO:tensorflow:number of input files: 4096 INFO:tensorflow:loading meta-graph: pretrainedmodel/model.ckpt-310001.meta INFO:tensorflow:restoring variables from pretrainedmodel/model.ckpt-310001 INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.NotFoundError'>, ../YT8M/youtube-8m/features/validatelN.tfrecord [[Node: train_input/ReaderReadV2_1 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](train_input/TFRecordReaderV2_1, train_input/input_producer)]]

Caused by op u'train_input/ReaderReadV2_1', defined at: File "inference.py", line 203, in app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "inference.py", line 199, in main FLAGS.output_file, FLAGS.batch_size, FLAGS.top_k) File "inference.py", line 128, in inference saver = tf.train.import_meta_graph(meta_graph_location, clear_devices=True) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1577, in import_meta_graph **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/meta_graph.py", line 498, in import_scoped_meta_graph producer_op_list=producer_op_list) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/importer.py", line 287, in import_graph_def op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2395, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1264, in init self._traceback = _extract_stack()

NotFoundError (see above for traceback): ../YT8M/youtube-8m/features/validatelN.tfrecord [[Node: train_input/ReaderReadV2_1 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](train_input/TFRecordReaderV2_1, train_input/input_producer)]]

In addition, i use the command begin testing as follows: python inference.py --output_file=test_video_v1.csv --input_data_pattern="video_test/test*.tfrecord" --model=NetVLADModelLF --train_dir=pretrainedmodel --frame_features=false --batch_size=1024 --base_learning_rate=0.0002 --netvlad_cluster_size=256 --netvlad_hidden_size=1024 --moe_l2=1e-6 --iterations=300 --learning_rate_decay=0.8 --netvlad_relu=False --gating=True --moe_prob_gating=True --run_once=True --top_k=50

Looking forward to your reply, thank you!

antoine77340 commented 6 years ago

Hi could you please try to delete the txt file (in the pretrainedmodel folder): graph.pbtxt please ? thank you

feiyun1265 commented 6 years ago

Hi, @antoine77340. I have delete graph.pbtxt, but the problem is still. INFO:tensorflow:number of input files: 4096 INFO:tensorflow:loading meta-graph: pretrainedmodel/model.ckpt-310001.meta INFO:tensorflow:restoring variables from pretrainedmodel/model.ckpt-310001 INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.NotFoundError'>, ../YT8M/youtube-8m/features/validatevr.tfrecord [[Node: train_input/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](train_input/TFRecordReaderV2, train_input/input_producer)]]

antoine77340 commented 6 years ago

Hmm very strange, I don't understand why it tries to look at ../YT8M/youtube-8m/features/ folder. (This is where I store all my tfrecord files). I don't think this is the source of error, but can you also try deleting all events.out.* files ? What about if you put all validation and train tfrecord files in ../YT8M/youtube-8m/features/ ?

feiyun1265 commented 6 years ago

Hi, @antoine77340. First i delete all events.out.* files, then put test tfrecord files to ../YT8M/youtube-8m/features/. But occur error as follows: NotFoundError (see above for traceback): ../YT8M/youtube-8m/features/trainMr.tfrecord [[Node: train_input/ReaderReadV2_1 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](train_input/TFRecordReaderV2_1, train_input/input_producer)]]

wincle commented 6 years ago

@feiyun1265 @antoine77340 I have also met this question and solve it. I'm working on it , and now I have finished a complete function with a video as input and output a label . I could share it , but there are still some difficult problems, I think the result I got is not absolutely right, It's strange and I don't know why did it happended. ` import os import numpy import tensorflow as tf import csv

class Vinfer(): def init(self,model_path='public/'): self.train_dir = model_path self.batch_size = 1024 self.top_k = 5 self.check_point = -1

    self.vocabulary = self.load_vocabulary()
    self.sess = self.load_model()

def load_vocabulary(self):
    vc = []
    csv_reader = csv.reader(open(self.train_dir+'/vocabulary.csv'))
    for row in csv_reader:
        single = {}
        single['Name'] = row[3]
        single['V1'] = row[5]
        single['V2'] = row[6]
        single['V3'] = row[7]
        vc.append(single)
    vc = vc[1:]
    return vc

def load_model(self):
    tf_config = tf.ConfigProto()
    tf_config.gpu_options.allow_growth = True
    sess = tf.Session(config=tf_config)
    latest_checkpoint = tf.train.latest_checkpoint(self.train_dir)
    if latest_checkpoint is None:
      raise Exception("unable to find a checkpoint at location: %s" % self.train_dir)
    else:
      if self.check_point < 0:
        meta_graph_location = latest_checkpoint + ".meta"
      else:
        meta_graph_location = self.train_dir + "/model.ckpt-" + str(self.check_point) + ".meta"
        latest_checkpoint = self.train_dir + "/model.ckpt-" + str(self.check_point)
    saver = tf.train.import_meta_graph(meta_graph_location, clear_devices=True)
    saver.restore(sess, latest_checkpoint)
    self.input_tensor = tf.get_collection("input_batch_raw")[0]
    self.num_frames_tensor = tf.get_collection("num_frames")[0]
    self.predictions_tensor = tf.get_collection("predictions")[0]
    def set_up_init_ops(variables):
        if "train_input" in variable.name:
          init_op_list.append(tf.assign(variable, 1))
          variables.remove(variable)
      init_op_list.append(tf.variables_initializer(variables))
      return init_op_list

    sess.run(set_up_init_ops(tf.get_collection_ref(tf.GraphKeys.LOCAL_VARIABLES)))
    return sess

def format_res(self,predictions):
    batch_size = len(predictions)
    res = []
    for video_index in range(batch_size):
        top_indices = numpy.argpartition(predictions[video_index], -self.top_k)[-self.top_k:]
        line = [[self.vocabulary[class_index], predictions[video_index][class_index]] for class_index in top_indices]
        line = sorted(line, key=lambda p: -p[1])
        res.append(line)
    return res[0]

def inference(self,video_batch_val,num_frames_batch_val):
    predictions_val, = self.sess.run([self.predictions_tensor],{self.input_tensor: video_batch_val, self.num_frames_tensor: num_frames_batch_val})

    return self.format_res(predictions_val)

`

antoine77340 commented 6 years ago

@feiyun1265: I mean you should try to move the Validation AND training tfrecord in this directory (not the test tfrecord). Could you please try that. I am sorry for all of the problems I am actually not really a Tensorflow expert :(.

feiyun1265 commented 6 years ago

Thanks for analysing.@antoine77340 @wincle. I copy validation and training tfrecord to " ../YT8M/youtube-8m/features/". No error before, but occur error as follows: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1096, in _run % (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape()))) ValueError: Cannot feed value of shape (160, 1024) for Tensor u'train_input/shuffle_batch_join:1', which has shape '(?, 300, 1152)'

wincle commented 6 years ago

@feiyun1265 the shape is not matched ,just expand it to the recommend shape.

wincle commented 6 years ago

@feiyun1265 And you should ues vggish to extract the audio feature and contact them together then send to the model

feiyun1265 commented 6 years ago

I use youtube-8m dataset. Does the dataset have audio feature? @wincle. and how to use vggish, can you send me some specification link, thanks.

wincle commented 6 years ago

https://github.com/tensorflow/models/tree/master/research/audioset If you are using youtube-8m , you don't need it. the image feature is 1024 dimensions and audio feature is 128 dimensions , you should use them all.

feiyun1265 commented 6 years ago

I will have a try, thank you very much. @wincle.

qingwa1990 commented 6 years ago

in model.ckpt-0.meta I got this pic。 _15175524282125

jiangzidong commented 6 years ago

met the same problem

jiangzidong commented 6 years ago

I think its the bug of the youtube8m's inference code. It should 1) create the model like eval did or 2) directly use the saved model

SharoneDayan commented 6 years ago

Dear @antoine77340, Thank you for the release of your code ! Very interesting !! I encountered the same issue as @feiyun1265 with tensorflow.python.framework.errors_impl.NotFoundError: ../YT8M/youtube-8m/features/trainGh.tfrecord; No such file or directory I do not want to download all the train and validation tf records as I use your released version of gatednetvladLF. I run the inference on my local machine and do not have 1TB memory to stock the frame-level features. Do you have an idea of how I can proceed?

wincle commented 6 years ago

@SharoneDayan I think may be you should try to freeze the model.

speculaas commented 6 years ago

Dear All, I encountered the same issue : tensorflow.python.framework.errors_impl.NotFoundError: ../YT8M/youtube-8m/features/trainGh.tfrecord; No such file or directory

After tracing inference.py , I think the inference process stopped at a try-except-finally in "def inference".

More specifically, I think the exception happened at : coord.should_stop().

Does anyone know for sure whether the exception happened in coord.should_stop()?

BR, JimmyYS

estathop commented 6 years ago

I also have the same problem when I try to execute the inference code, to be precise the output csv file only has 1 or 2 videos with labels then nothing, my error message follows

INFO:tensorflow:num examples processed: 2 elapsed seconds: 2.13 Traceback (most recent call last): File "inference.py", line 203, in app.run() File "/home/estathop/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "inference.py", line 199, in main FLAGS.output_file, FLAGS.batch_size, FLAGS.top_k) File "inference.py", line 172, in inference coord.join(threads) File "/home/estathop/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/home/estathop/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run enqueue_callable() File "/home/estathop/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1254, in _single_tensor_run results = self._call_tf_sessionrun(None, {}, fetch_list, [], None) File "/home/estathop/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.NotFoundError: ../YT8M/youtube-8m/features/validatenp.tfrecord; No such file or directory [[Node: train_input/ReaderReadV2_4 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](train_input/TFRecordReaderV2_4, train_input/input_producer)]]

estathop commented 6 years ago

the question is why is it trying to look for tfrecords in that particular folder "/YT8M/youtube-8m/features/"? I can't trace where this happens so I can delete it

speculaas commented 6 years ago

Dear All, I tried to remove code snippet related to training as the following printScreen.

And the result is that the error no longer happens but the programming did not terminate.

remove_tf-train screenshot from 2018-05-17 22-28-41 and the output file remained empty:

--output_file=test-gatednetvladLF-256k-1024-80-0002-300iter-norelu-basic-gatedmoe.csv

not even :

out_file.write("VideoId,LabelConfidencePairs\n")

is printed

Hope me or anyone can see why

BR, JimmyYS

estathop commented 6 years ago

num_examples_processed initializes with 0. out_file.write("VideoId,LabelConfidencePairs\n") doesn't need to be printed, check only if the CSV file contains those strings in two columns. you need the threads to perform the evaluations and the while not statement to iterate through every batch given. I supposed that by eliminating the while not statement and coord.request_stop() you end up doing no calculations at all

maybe tf.graphkeys.local_Variables is doing the damage, I read in documentation that this is about objects that are local to each machine. Maybe this is where the paths are saved ?

what if instead tf.graphkeys.MODEL_VARIABLES ? what should that be doing ?

speculaas commented 6 years ago

Dear Estathop, Embarrassingly, I dont understand tensorflow enough, need to study more. For now, after adding some "out_file.flush()"s, I get classification result in specified output file. I tested this youtube : https://www.youtube.com/watch?v=3VUiz10w-aw And the result is :

$ cat prediction.2.csv VideoId,LabelConfidencePairs Hokuriku_E7_shinkansen.mkv,1 0.589722 62 0.283732 4 0.269821 11 0.041280 155 0.036597 10 0.020198 23 0.013597 121 0.012935 72 0.012433 12 0.009783 29 0.005635 17 0.004996 64 0.004927 50 0.004772 103 0.004416 20 0.004155 6 0.003413 633 0.003395 3764 0.002889 0 0.002232 1117 0.002042 28 0.002025 263 0.001938 152 0.001888 9 0.001886 68 0.001816 1348 0.001554 248 0.001477 82 0.001275 139 0.001250 330 0.001085 74 0.001053 337 0.000995 47 0.000988 92 0.000988 162 0.000959 2 0.000935 27 0.000923 70 0.000892 8 0.000811 25 0.000794 154 0.000785 286 0.000763 101 0.000738 69 0.000712 126 0.000711 148 0.000704 67 0.000664 234 0.000655 176 0.000643

where 1 0.589722 (Vehicle) 62 0.283732 (Train) 4 0.269821 (Car) 11 0.041280 (Motorsport) 155 0.036597 (Camera) 10 0.020198 (Animal) by looking up : https://research.google.com/youtube8m/csv/vocabulary.csv

However, even though my output csv now has result, exception still happened as before: INFO:tensorflow:restoring variables from /public/model.ckpt-310001 INFO:tensorflow:Restoring parameters from /public/model.ckpt-310001

2018-05-18 14:09:11.212298: W tensorflow/core/framework/allocator.cc:101] Allocation of 1140850688 exceeds 10% of system memory. INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.NotFoundError'>, ../YT8M/youtube-8m/features/trainkj.tfrecord; No such file or directory [[Node: train_input/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](train_input/TFRecordReaderV2, train_input/input_producer)]] 2018-05-18 14:09:13.653464: W tensorflow/core/framework/allocator.cc:101] Allocation of 1415577600 exceeds 10% of system memory. INFO:tensorflow:num examples processed: 1 elapsed seconds: 1.45 Traceback (most recent call last): File "inference_no_train.py", line 208, in app.run() local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.NotFoundError: ../YT8M/youtube-8m/features/trainkj.tfrecord; No such file or directory [[Node: train_input/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](train_input/TFRecordReaderV2, train_input/input_producer)]]

BR, JimmyYS

estathop commented 6 years ago

No worries, I started using tensorflow recently also, I am not an expert.

The problem is in the "try:" block, I managed to retrieve the same tfrecord indefinitely and the program didn't crash it just continued to run forever. The problem is when trying to parse the second tfrecord file. The specific error's root is in the 2nd loop of the "while not coord.should_stop():" block and specifically video_id_batch_val, video_batch_val,num_frames_batch_val = sess.run([video_id_batch, video_batch, num_frames_batch]) but I don't understand why, I tried to print video_id_batch for example with the *.eval() inherited function as it is a tensor object but nothing happened, cmd crashed.

puneetiitian commented 6 years ago

Hi All,

I am also getting the similar error when I try to run inference.py over pretrained model released by Antoine. It says: tensorflow.python.framework.errors_impl.NotFoundError: ../YT8M/youtube-8m/features/train-C.tfrecord; No such file or directory

Has anyone solved this problem yet? Thanks in advance

suhmily commented 6 years ago

Hi all, I met the same problem. Beg for solutions, please.

CC10010 commented 6 years ago

@antoine77340 thanks for your work when I use the pretrained model ,it occur: tensorflow.python.framework.errors_impl.NotFoundError: ../YT8M/youtube-8m/features/trainGh.tfrecord; No such file or directory As I know,in your workspace ,there is a file named /YT8M/youtube-8m/features/trainGh.tfrecord please upload the file,then the inference.py can run normally

wenching33 commented 6 years ago

Dears, I met the same problem when using the pretrained model. Is anyone try training your own model and no problem, mentioned above, occurred ? I am hesitating about training the model myself. Thanks in advance.

wenching33 commented 6 years ago

Dears, Just share some information. When I tried to train my own model using scripts given on this git (https://github.com/antoine77340/Youtube-8M-WILLOW), there were error messages like "InvalidArgumentError: Name: , Context feature 'video_id' is required but could not be found". I changed readers.py to solve the problem.

Then I use my own trained model to do inference successfully(XXX.csv is produced and prediction results are printed) I found a graph.pbtxt in the directory of my trained model, which also appears in the released pretrained model(if you download the pretrained willow model and extract contents you can find it). Inside the graph.pbtxt, there is a node named: " "train_input/input_producer/Const" That includes an attribute with key: "value". There I found string_vals like string_val: "/dataset/SP/Phil/frame/train/train0111.tfrecord" string_val: "/dataset/SP/Phil/frame/train/train0580.tfrecord" They are where I put my training data. That means the trained model is associated with the training data. So, I guess the released pretrained model can only be used successfully with those training data the author used. But it's not reasonable because those .tfrecord are so big. Maybe one should modify the graph so that the node "train_input/input_producer/Const" not containing training-data-related information. (I'm not sure if it is feasible) Or just release frozen .pb model.

chendengshuai commented 6 years ago

Dears, Just share some information. When I tried to train my own model using scripts given on this git (https://github.com/antoine77340/Youtube-8M-WILLOW), there were error messages like "InvalidArgumentError: Name: , Context feature 'video_id' is required but could not be found". I changed readers.py to solve the problem.

Then I use my own trained model to do inference successfully(XXX.csv is produced and prediction results are printed) I found a graph.pbtxt in the directory of my trained model, which also appears in the released pretrained model(if you download the pretrained willow model and extract contents you can find it). Inside the graph.pbtxt, there is a node named: " "train_input/input_producer/Const" That includes an attribute with key: "value". There I found string_vals like string_val: "/dataset/SP/Phil/frame/train/train0111.tfrecord" string_val: "/dataset/SP/Phil/frame/train/train0580.tfrecord" They are where I put my training data. That means the trained model is associated with the training data. So, I guess the released pretrained model can only be used successfully with those training data the author used. But it's not reasonable because those .tfrecord are so big. Maybe one should modify the graph so that the node "train_input/input_producer/Const" not containing training-data-related information. (I'm not sure if it is feasible) Or just release frozen .pb model.

hi wenching33 how do you solve the problem "InvalidArgumentError: Name: , Context feature 'video_id' is required but could not be found". can you give some details? thanks in advance!

estathop commented 6 years ago

just don't mix dataset, code etc of 2017 with 2018, because some classes names changed due to GDPR issues

wenching33 commented 5 years ago

@chendengshuai I remember that I modify code in readers.py. You can find "vidio_id" in readers.py, and try changing it to "id".

orprager commented 5 years ago

@antoine77340 thank you for publishing it. I believe I might have found a solution for the error mentioned here.

Both the process of training (train.py) and the process of evaluation (inference.py) use a queue containing runners that are used to produce and shuffle inputs. Those runners are part of the graph, and are stored in checkpoints. This means that when the model is restored, they are again included in the graph. If the training process is stopped after reaching a satisfying accuracy level, those runners are interrupted, which causes them to search for train set files when the model is restored. However, if we wish to restore the model and use it only for evaluation (inference.py), there is no actual need in the train set files.

To be able to run the process without the files, in inference.py, in the inference function, before the line “threads = tf.train.start_queue_runners(sess=sess, coord=coord)” I added the following:

runners_to_remove = [q_runner for q_runner in sess.graph.get_collection_ref(“queue_runners”) if q_runner.name == “train_input/input_producer” or q_runner.name == ‘train_input/shuffle_batch_join/random_shuffle_queue’] for runner in runners_to_remove: sess.graph.get_collection_ref(“queue_runners”).remove(runner)

This removes the runners that were saved during the training process, but are no longer needed for evaluation from a pre-trained model. By adding those lines we can run inference.py for evaluation without the need of train set files.

Hope that helps.