Issues in the dataset - Githubissues

avinashsai commented 3 years ago

When I tried to load the train.csv, I observed these errors:

Rows 58466, 2355, 37523, 67237 have the same text for 'prompt' and 'utterance' columns. The real text for utterance is in a wrong column.
Due to these wrong indentations, pandas always throws a error as these rows have unequal number of columns with the rest.

I haven't checked for valid.csv and test.csv. please, fix these in the files.

Thank you

EricMichaelSmith commented 3 years ago

Hi @avinashsai - sorry for the delay! The 'prompt' and 'utterance' columns can be the same if the Speaker simply gives the prompt as the first utterance of the conversation. Can you give me the pandas command that is failing when you try to load these files? I can try to reproduce

avinashsai commented 3 years ago

`import pandas as pd

data = pd.read_csv('train.csv')

Traceback (most recent call last): File "", line 1, in File "/home/user/anaconda2/envs/env/lib/python3.7/site-packages/pandas/io/parsers.py", line 686, in read_csv return _read(filepath_or_buffer, kwds) File "/home/user/anaconda2/envs/env/lib/python3.7/site-packages/pandas/io/parsers.py", line 458, in _read data = parser.read(nrows) File "/home/user/anaconda2/envs/env/lib/python3.7/site-packages/pandas/io/parsers.py", line 1186, in read ret = self._engine.read(nrows) File "/home/user/anaconda2/envs/env/lib/python3.7/site-packages/pandas/io/parsers.py", line 2145, in read data = self._reader.read(nrows) File "pandas/_libs/parsers.pyx", line 826, in pandas._libs.parsers.TextReader.read File "pandas/_libs/parsers.pyx", line 841, in pandas._libs.parsers.TextReader._read_low_memory File "pandas/_libs/parsers.pyx", line 897, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 884, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 2021, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 2355, saw 10`

EricMichaelSmith commented 3 years ago

I've just looked into this - the way this repo loads in that file is by reading it in a text file and then processing it line-by-line, here: https://github.com/facebookresearch/EmpatheticDialogues/blob/master/empchat/datasets/empchat.py#L84 I'd try that instead of loading it as a pandas DataFrame directly

wilmeragsgh commented 3 years ago

I worked around this with by replacing " with sed -i 's/"/\\"/g' train.csv then, I read it with: df = pd.read_csv("train.csv", sep=",", encoding='utf-8', engine="python", escapechar="\\")

amrta-coder commented 1 year ago

Although the issue has already been closed, I would like to raise another solution to this. If this solution has any hidden problem, please kindly let me know it.

df = pd.read_csv("./train.csv", usecols=['conv_id', 'utterance_idx', 'context', 'prompt', 'speaker_idx', 'utterance', 'selfeval', 'tags'])

However, if you look into the train.csv, you will find the following problem: For hit:832_conv:1665, the utterance column has been put into some information, which is supposed to be the following lines.

Snip20230407_9

facebookresearch / EmpatheticDialogues

Issues in the dataset #40