Elastic search save_guesses

ssanjiv commented 7 years ago

When generating the guesser report for elastic search, I get this error:

File "/Users/ogg/Programming/Test Project/qb_old/qanta/guesser/abstract.py", line 230, in save_guesses
    fold_df = guess_df[guess_df.fold == fold]
TypeError: invalid type comparison

Before the error, here are some additional debug statements from the log:

2017-08-04 21:22:20,787 - qanta.guesser.abstract - INFO - Creating guess dataframe from guesses...
2017-08-04 21:22:20,938 - qanta.pipeline.guesser - INFO - Guessing on guessdev fold took 20.354990005493164s, saving guesses...
2017-08-04 21:22:20,938 - qanta.guesser.abstract - INFO - Saving fold guessdev

EntilZha commented 7 years ago

Can you insert a print statement before that line to print both the value and type of the guess df and fold

ssanjiv commented 7 years ago

type(fold): <class 'str'>
fold: guessdev
type(guess_df.fold): <class 'pandas.core.series.Series'>
guess_df.fold: Series([], Name: fold, dtype: float64)

EntilZha commented 7 years ago

It looks like the fold is empty, so you need to trace back and figure out why it's empty since that is unusual

ssanjiv commented 7 years ago

According to my previous logs, the guess dataframe did generate successfully, though. I'll look into this further and let you know if I can find any anomalies.

ssanjiv commented 7 years ago

Here's the full stack trace. Is there anything pertinent in it?

DEBUG: Checking if AllSingleGuesserReports() is complete
DEBUG: Checking if GuesserReport(guesser_module=qanta.guesser.elasticsearch, guesser_class=ElasticSearchGuesser, dependency_module=qanta.pipeline.guesser, dependency_class=EmptyTask) is complete
INFO: Informed scheduler that task   AllSingleGuesserReports__99914b932b   has status   PENDING
DEBUG: Checking if GenerateGuesses(guesser_module=qanta.guesser.elasticsearch, guesser_class=ElasticSearchGuesser, dependency_module=qanta.pipeline.guesser, dependency_class=EmptyTask, n_guesses=50, fold=guessdev) is complete
INFO: Informed scheduler that task   GuesserReport_EmptyTask_qanta_pipeline_g_ElasticSearchGue_c4f6c539c0   has status   PENDING
DEBUG: Checking if TrainGuesser(guesser_module=qanta.guesser.elasticsearch, guesser_class=ElasticSearchGuesser, dependency_module=qanta.pipeline.guesser, dependency_class=EmptyTask) is complete
INFO: Informed scheduler that task   GenerateGuesses_EmptyTask_qanta_pipeline_g_guessdev_6c194102b1   has status   PENDING
INFO: Informed scheduler that task   TrainGuesser_EmptyTask_qanta_pipeline_g_ElasticSearchGue_c4f6c539c0   has status   DONE
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 3
INFO: [pid 60357] Worker Worker(salt=080904910, workers=1, host=Shravan, username=ogg, pid=60357) running   GenerateGuesses(guesser_module=qanta.guesser.elasticsearch, guesser_class=ElasticSearchGuesser, dependency_module=qanta.pipeline.guesser, dependency_class=EmptyTask, n_guesses=50, fold=guessdev)
2017-08-08 00:10:47,126 - qanta.pipeline.guesser - INFO - Generating and saving guesses for guessdev fold with word_skip=-1...
2017-08-08 00:10:51,051 - qanta.spark - INFO - Requested 15 cores when the machine only has 4 cores, reducing number of cores to 4
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/08/08 00:10:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/08 00:10:53 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2017-08-08 00:11:05,691 - qanta.guesser.abstract - INFO - Creating guess dataframe from guesses...
2017-08-08 00:11:05,829 - qanta.pipeline.guesser - INFO - Guessing on guessdev fold took 18.702142000198364s, saving guesses...
2017-08-08 00:11:05,829 - qanta.guesser.abstract - INFO - Saving fold guessdev
type(fold): <class 'str'>
fold: guessdev
type(guess_df.fold): <class 'pandas.core.series.Series'>
guess_df.fold: Series([], Name: fold, dtype: float64)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/ops.py:798: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  result = getattr(x, name)(y)
ERROR: [pid 60357] Worker Worker(salt=080904910, workers=1, host=Shravan, username=ogg, pid=60357) failed    GenerateGuesses(guesser_module=qanta.guesser.elasticsearch, guesser_class=ElasticSearchGuesser, dependency_module=qanta.pipeline.guesser, dependency_class=EmptyTask, n_guesses=50, fold=guessdev)
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/luigi/worker.py", line 191, in run
    new_deps = self._run_get_new_deps()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/luigi/worker.py", line 129, in _run_get_new_deps
    task_gen = self.task.run()
  File "/Users/ogg/Programming/Test Project/qb_old/qanta/pipeline/guesser/__init__.py", line 100, in run
    guesser_class.save_guesses(guess_df, guesser_directory, [self.fold])
  File "/Users/ogg/Programming/Test Project/qb_old/qanta/guesser/abstract.py", line 234, in save_guesses
    fold_df = guess_df[guess_df.fold == fold]
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/ops.py", line 861, in wrapper
    res = na_op(values, other)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/ops.py", line 800, in na_op
    raise TypeError("invalid type comparison")
TypeError: invalid type comparison
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   GenerateGuesses_EmptyTask_qanta_pipeline_g_guessdev_6c194102b1   has status   FAILED
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
DEBUG: There are 3 pending tasks possibly being run by other workers
DEBUG: There are 3 pending tasks unique to this worker
DEBUG: There are 3 pending tasks last scheduled by this worker
INFO: Worker Worker(salt=080904910, workers=1, host=Shravan, username=ogg, pid=60357) was stopped. Shutting down Keep-Alive thread
INFO:
===== Luigi Execution Summary =====

Scheduled 4 tasks of which:
* 1 present dependencies were encountered:
    - 1 TrainGuesser(guesser_module=qanta.guesser.elasticsearch, guesser_class=ElasticSearchGuesser, dependency_module=qanta.pipeline.guesser, dependency_class=EmptyTask)
* 1 failed:
    - 1 GenerateGuesses(...)
* 2 were left pending, among these:
    * 2 had failed dependencies:
        - 1 AllSingleGuesserReports()
        - 1 GuesserReport(guesser_module=qanta.guesser.elasticsearch, guesser_class=ElasticSearchGuesser, dependency_module=qanta.pipeline.guesser, dependency_class=EmptyTask)

This progress looks :( because there were failed tasks

===== Luigi Execution Summary =====

EntilZha commented 7 years ago

So now find where guesses are passed from spark to pandas and see if they are empty there by inserting another print statement. Figuring out why that dataframe/series is empty might point to how to fix the issue

ssanjiv commented 7 years ago

Right, that was my train of thought as well, but I previously had spark setup issues so I posted the full trace to ask if there was anything odd about it. In other words, I think it could be my spark configuration, but I'm not sure what the specific issue could be.

I verified manually that qanta's create_spark_context() method runs properly, but I'm unsure how to further verify other spark related functionality.

EntilZha commented 7 years ago

So the guesses are created from the guesser here: https://github.com/Pinafore/qb/blob/master/qanta/guesser/abstract.py#L180

If you look at that it calls the abstract method AbstractGuesser.guess which means you need to look at the implementation for that specific guesser. So the first step is to do in a python terminal create an instance of the guesser which you can do with something like guesser = ElasticSearchGuesser.load('output/guesser/.....rest of path here'). Then pass it questions to guess on and verify that it works as expected.

If that seems to work then I would the code block below (you need to change self to reference an in instance of QuizBowlDataset and take care of the extra indent).

        dataset = self.qb_dataset()
        questions_by_fold = dataset.questions_by_fold()
        max_n_guesses = 200

        q_folds = []
        q_qnums = []
        q_sentences = []
        q_tokens = []
        question_texts = []
        fold = 'guessdev'
        questions = questions_by_fold[fold]
        for q in questions:
            for sent, token, text_list in q.partials(word_skip=word_skip):
                text = ' '.join(text_list)
                question_texts.append(text)
        print(guesser.guess(question_texts, max_n_guesses))

ssanjiv commented 7 years ago

I loaded the the guesser, and guessed on a few dev questions. They all returned empty guesses. Here's my output for one of them:

>>> guesser = ElasticSearchGuesser.load('output/guesser/qanta.guesser.elasticsearch.ElasticSearchGuesser')                                   
>>> guesser.guess(["His early work is in a rather weak Neoclassical style, as seen in 1768's Agrippina Landing at Brundisium with the Ashes of Germanicus."], 10)
2017-08-09 20:42:51,146 - qanta.spark - INFO - Requested 15 cores when the machine only has 4 cores, reducing number of cores to 4
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/08/09 20:42:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/09 20:42:54 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
[[]]

EntilZha commented 7 years ago

So the next step is to look at https://github.com/Pinafore/qb/blob/master/qanta/guesser/elasticsearch.py#L115 to figure out if its a spark thing or an elastic search thing that is giving you no guesses.

Pinafore / qb

Elastic search save_guesses #65