cdqa-suite / cdQA

⛔ [NOT MAINTAINED] An End-To-End Closed Domain Question Answering System.
https://cdqa-suite.github.io/cdQA-website/
Apache License 2.0
614 stars 191 forks source link

MemoryError workaround #357

Open nortz8 opened 4 years ago

nortz8 commented 4 years ago

Kindly consider changing the def _expand_paragraphs function in the cdqa_sklearn.py file to accommodate larger datasets. Modifying the dataframe needs a lot of memory for bigger data so it would be better to set it as a list of dict before making it a dataframe.

Below is the modification I did so I would not get a MemoryError:

@staticmethod
   def _expand_paragraphs(df): 
        data=[]
        for n in range(len(df)):  
            stringlist = df.iloc[n][1]  
            for m in range(len(stringlist)): 
                a=df.iloc[n][0] 
                b=stringlist[m] 
                data.append({'title' : a, 'content' : b}) 
        dfx = pd.DataFrame(data) 
        return dfx
adjouama commented 4 years ago

Very good point. +1 @nortz8 However, your workaround did not work for me. I ended up having the following; ValueError: empty vocabulary; perhaps the documents only contain stop words

Any idea why ?