cdqa-suite / cdQA

⛔ [NOT MAINTAINED] An End-To-End Closed Domain Question Answering System.
https://cdqa-suite.github.io/cdQA-website/
Apache License 2.0
616 stars 191 forks source link

Unable to run the fit_retriever in cdQA #326

Closed SmritiSatyan closed 4 years ago

SmritiSatyan commented 4 years ago

I have my csv file in place, that has 2 columns- 'title', and 'paragraphs'. When I try to run the line of code - cdqa_pipeline.fit_retriever(df), I get the error- ValueError: zero-dimensional arrays cannot be concatenated and it talks about an issue in this line ).assign(**{lst_col: np.concatenate(df[lst_col].values)})[df.columns].

In the cdqa_sklearn.py file, I changed the line 'np.concatenate' to 'np.array', but when I do that, I get a different error : ValueError: Length of values does not match length of index but when I cross verified it, my index and the columns in my dataframe have the same length. My dataframe doesn't have any empty columns/nans as well. Below is the stack trace:

`Traceback (most recent call last):

  File "<ipython-input-359-3424ca0c1bfa>", line 1, in <module>
    cdqa_pipeline.fit_retriever(df)

  File "C:\Users\smriti\AppData\Roaming\Python\Python37\site-packages\cdqa\pipeline\cdqa_sklearn.py", line 114, in fit_retriever
    self.metadata = self._expand_paragraphs(df)

  File "C:\Users\smriti\AppData\Roaming\Python\Python37\site-packages\cdqa\pipeline\cdqa_sklearn.py", line 237, in _expand_paragraphs
    ).assign(**{lst_col: np.array(df[lst_col].values)})[df.columns]

  File "C:\Users\smriti\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\frame.py", line 3649, in assign
    data[k] = com.apply_if_callable(v, data)

  File "C:\Users\smriti\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\frame.py", line 3467, in __setitem__
    self._set_item(key, value)

  File "C:\Users\smriti\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\frame.py", line 3544, in _set_item
    value = self._sanitize_column(key, value)

  File "C:\Users\smriti\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\frame.py", line 3729, in _sanitize_column
    value = sanitize_index(value, self.index, copy=False)

  File "C:\Users\smriti\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\construction.py", line 612, in sanitize_index
    raise ValueError("Length of values does not match length of index")

ValueError: Length of values does not match length of index`

when I call cdqa_pipeline.fit_retriever(df)

When reading the csv using the pandas dataframe, the below line is what I used: df = pd.read_csv('path to csv file') Any help on this would be much appreciated.

andrelmfarias commented 4 years ago

Is your paragraphs column filled with lists of paragraphs for each row (like the structure indicated in the readme)?

If it's not, it might be what is causing the problem.

Also, be sure to load the dataframe using the literal_eval function as converter to the paragraphs:

df = pd.read_csv('your-file.csv', converters={'paragraphs': literal_eval})
SmritiSatyan commented 4 years ago

Hello sir, thank you for the reply. Yes my paragraphs column is a list that contains multiple paragraphs. When I try to load the data frame with the literal_eval function, I encounter "malformed string" error.

SmritiSatyan commented 4 years ago

Update: The literal_eval function is not working on my data, I have decided to go with 'eval' function. Hence, my line of code that reads the CSV file into a dataframe would look like below: df = pd.read_csv('your-file.csv', converters={'paragraphs': eval})