Closed ssanjiv closed 6 years ago
Can you insert a print statement before that line to print both the value and type of the guess df and fold
type(fold): <class 'str'>
fold: guessdev
type(guess_df.fold): <class 'pandas.core.series.Series'>
guess_df.fold: Series([], Name: fold, dtype: float64)
It looks like the fold is empty, so you need to trace back and figure out why it's empty since that is unusual
According to my previous logs, the guess dataframe did generate successfully, though. I'll look into this further and let you know if I can find any anomalies.
Here's the full stack trace. Is there anything pertinent in it?
DEBUG: Checking if AllSingleGuesserReports() is complete
DEBUG: Checking if GuesserReport(guesser_module=qanta.guesser.elasticsearch, guesser_class=ElasticSearchGuesser, dependency_module=qanta.pipeline.guesser, dependency_class=EmptyTask) is complete
INFO: Informed scheduler that task AllSingleGuesserReports__99914b932b has status PENDING
DEBUG: Checking if GenerateGuesses(guesser_module=qanta.guesser.elasticsearch, guesser_class=ElasticSearchGuesser, dependency_module=qanta.pipeline.guesser, dependency_class=EmptyTask, n_guesses=50, fold=guessdev) is complete
INFO: Informed scheduler that task GuesserReport_EmptyTask_qanta_pipeline_g_ElasticSearchGue_c4f6c539c0 has status PENDING
DEBUG: Checking if TrainGuesser(guesser_module=qanta.guesser.elasticsearch, guesser_class=ElasticSearchGuesser, dependency_module=qanta.pipeline.guesser, dependency_class=EmptyTask) is complete
INFO: Informed scheduler that task GenerateGuesses_EmptyTask_qanta_pipeline_g_guessdev_6c194102b1 has status PENDING
INFO: Informed scheduler that task TrainGuesser_EmptyTask_qanta_pipeline_g_ElasticSearchGue_c4f6c539c0 has status DONE
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 3
INFO: [pid 60357] Worker Worker(salt=080904910, workers=1, host=Shravan, username=ogg, pid=60357) running GenerateGuesses(guesser_module=qanta.guesser.elasticsearch, guesser_class=ElasticSearchGuesser, dependency_module=qanta.pipeline.guesser, dependency_class=EmptyTask, n_guesses=50, fold=guessdev)
2017-08-08 00:10:47,126 - qanta.pipeline.guesser - INFO - Generating and saving guesses for guessdev fold with word_skip=-1...
2017-08-08 00:10:51,051 - qanta.spark - INFO - Requested 15 cores when the machine only has 4 cores, reducing number of cores to 4
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/08/08 00:10:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/08 00:10:53 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2017-08-08 00:11:05,691 - qanta.guesser.abstract - INFO - Creating guess dataframe from guesses...
2017-08-08 00:11:05,829 - qanta.pipeline.guesser - INFO - Guessing on guessdev fold took 18.702142000198364s, saving guesses...
2017-08-08 00:11:05,829 - qanta.guesser.abstract - INFO - Saving fold guessdev
type(fold): <class 'str'>
fold: guessdev
type(guess_df.fold): <class 'pandas.core.series.Series'>
guess_df.fold: Series([], Name: fold, dtype: float64)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/ops.py:798: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
result = getattr(x, name)(y)
ERROR: [pid 60357] Worker Worker(salt=080904910, workers=1, host=Shravan, username=ogg, pid=60357) failed GenerateGuesses(guesser_module=qanta.guesser.elasticsearch, guesser_class=ElasticSearchGuesser, dependency_module=qanta.pipeline.guesser, dependency_class=EmptyTask, n_guesses=50, fold=guessdev)
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/luigi/worker.py", line 191, in run
new_deps = self._run_get_new_deps()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/luigi/worker.py", line 129, in _run_get_new_deps
task_gen = self.task.run()
File "/Users/ogg/Programming/Test Project/qb_old/qanta/pipeline/guesser/__init__.py", line 100, in run
guesser_class.save_guesses(guess_df, guesser_directory, [self.fold])
File "/Users/ogg/Programming/Test Project/qb_old/qanta/guesser/abstract.py", line 234, in save_guesses
fold_df = guess_df[guess_df.fold == fold]
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/ops.py", line 861, in wrapper
res = na_op(values, other)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/ops.py", line 800, in na_op
raise TypeError("invalid type comparison")
TypeError: invalid type comparison
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task GenerateGuesses_EmptyTask_qanta_pipeline_g_guessdev_6c194102b1 has status FAILED
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
DEBUG: There are 3 pending tasks possibly being run by other workers
DEBUG: There are 3 pending tasks unique to this worker
DEBUG: There are 3 pending tasks last scheduled by this worker
INFO: Worker Worker(salt=080904910, workers=1, host=Shravan, username=ogg, pid=60357) was stopped. Shutting down Keep-Alive thread
INFO:
===== Luigi Execution Summary =====
Scheduled 4 tasks of which:
* 1 present dependencies were encountered:
- 1 TrainGuesser(guesser_module=qanta.guesser.elasticsearch, guesser_class=ElasticSearchGuesser, dependency_module=qanta.pipeline.guesser, dependency_class=EmptyTask)
* 1 failed:
- 1 GenerateGuesses(...)
* 2 were left pending, among these:
* 2 had failed dependencies:
- 1 AllSingleGuesserReports()
- 1 GuesserReport(guesser_module=qanta.guesser.elasticsearch, guesser_class=ElasticSearchGuesser, dependency_module=qanta.pipeline.guesser, dependency_class=EmptyTask)
This progress looks :( because there were failed tasks
===== Luigi Execution Summary =====
So now find where guesses are passed from spark to pandas and see if they are empty there by inserting another print statement. Figuring out why that dataframe/series is empty might point to how to fix the issue
Right, that was my train of thought as well, but I previously had spark setup issues so I posted the full trace to ask if there was anything odd about it. In other words, I think it could be my spark configuration, but I'm not sure what the specific issue could be.
I verified manually that qanta's create_spark_context() method runs properly, but I'm unsure how to further verify other spark related functionality.
So the guesses are created from the guesser here: https://github.com/Pinafore/qb/blob/master/qanta/guesser/abstract.py#L180
If you look at that it calls the abstract method AbstractGuesser.guess
which means you need to look at the implementation for that specific guesser. So the first step is to do in a python terminal create an instance of the guesser which you can do with something like guesser = ElasticSearchGuesser.load('output/guesser/.....rest of path here')
. Then pass it questions to guess on and verify that it works as expected.
If that seems to work then I would the code block below (you need to change self
to reference an in instance of QuizBowlDataset
and take care of the extra indent).
dataset = self.qb_dataset()
questions_by_fold = dataset.questions_by_fold()
max_n_guesses = 200
q_folds = []
q_qnums = []
q_sentences = []
q_tokens = []
question_texts = []
fold = 'guessdev'
questions = questions_by_fold[fold]
for q in questions:
for sent, token, text_list in q.partials(word_skip=word_skip):
text = ' '.join(text_list)
question_texts.append(text)
print(guesser.guess(question_texts, max_n_guesses))
I loaded the the guesser, and guessed on a few dev questions. They all returned empty guesses. Here's my output for one of them:
>>> guesser = ElasticSearchGuesser.load('output/guesser/qanta.guesser.elasticsearch.ElasticSearchGuesser')
>>> guesser.guess(["His early work is in a rather weak Neoclassical style, as seen in 1768's Agrippina Landing at Brundisium with the Ashes of Germanicus."], 10)
2017-08-09 20:42:51,146 - qanta.spark - INFO - Requested 15 cores when the machine only has 4 cores, reducing number of cores to 4
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/08/09 20:42:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/09 20:42:54 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
[[]]
So the next step is to look at https://github.com/Pinafore/qb/blob/master/qanta/guesser/elasticsearch.py#L115 to figure out if its a spark thing or an elastic search thing that is giving you no guesses.
When generating the guesser report for elastic search, I get this error:
Before the error, here are some additional debug statements from the log: