cdqa-suite / cdQA

⛔ [NOT MAINTAINED] An End-To-End Closed Domain Question Answering System.
https://cdqa-suite.github.io/cdQA-website/
Apache License 2.0
614 stars 191 forks source link

Unable to read the csv file using pandas after generating the csv manually #312

Closed SmritiSatyan closed 4 years ago

SmritiSatyan commented 4 years ago

Hello, I see that the data frame in the format (title, paragraphs) can be generated with the help of converters or manually. I scraped the inspectapedia data and stored it in a text file. I manually extracted the data and stored it in [title, paragraph] columns of a CSV file. But I am unable to read the CSV file. Getting the below error:

import pandas as pd
from ast import literal_eval
from cdqa.pipeline import QAPipeline

df = pd.read_csv('paht-to-csv', converters={'paragraphs': literal_eval})

SyntaxError: invalid syntax

When I tried to remove the literal_eval parameter, I get the below error:

import pandas as pd
from ast import literal_eval
from cdqa.pipeline import QAPipeline

df = pd.read_csv('path to csv file')

cdqa_pipeline = QAPipeline(reader='distilbert_qa.joblib') # use 'distilbert_qa.joblib' for DistilBERT instead of BERT
cdqa_pipeline.fit_retriever(df=df)

ValueError: zero-dimensional arrays cannot be concatenated

Any help on this would be appreciated

aqsa27 commented 4 years ago

Facing the same error

SmritiSatyan commented 4 years ago

I was unable to get this to work. Found a workaround. Instead of manually creating the CSV file, I stored my data in separate PDFs. Next, I ran the pdf_converter function on this directory which contains all the PDFs. This generated a dataframe for me. In the dataframe, every row corresponds to a single PDF.

andrelmfarias commented 4 years ago

@SmritiSatyan ,

How was the structure of your dataframe when you removed the literal_eval function? Did it keep the list structure for the paragraphs columns?

it should be like that:

paragraphs
[Paragraph 1 of Article, ... , Paragraph N of Article]
SmritiSatyan commented 4 years ago

@andrelmfarias I manually prepared the CSV and when I sent it to df = pd.read_csv('path to csv') it generated a dataframe that had 2 columns- title and paragraph. The structure of the 'paragraphs' columns looked exactly like what you have mentioned. Yes, the list structure for paragraphs was maintained.