Closed FelixNeutatz closed 5 years ago
Hi,
Kind regards, Mohammad
Ok, thank you :)
In my opinion, it would be better to always call the first data row the 0th row, no matter whether there is a header or not.
I just saw that you assume that the data has a header:
for row, column in detected_cells_list:
i = int(row)
j = int(column)
v = None
if (i, j) not in cell_visited_flag and i > 0:
cell_visited_flag[(i, j)] = 1
return_list.append([i, j, v])
Because you only add a cell if i > 0
. Is this correct?
I think it is an implicit standard in data cleaning tools that the first row of dataset is row 0, regardless of its content that could be header or data. Some of the tools such as dboost may output row 0 as error. However, we assume that dataset always has a header. So, we are not interested to detect the name of attributes as errors. That is why the condition i > 0 exists.
Note that, we consider the following assumptions for input dataset:
Great answer :) It would be great if we can create a wiki page for this repository. There we can put among other tutorials also these assumptions. That would be super helpful.
Sure. Remember that this project is going to be extended as our data cleaning framework that contains all of our projects as different data cleaning services. Therefore, we all need to have same standards and use same APIs for different basic tasks. Of course, we need to create wiki and also discuss about different aspects that we need to consider. The current project is just a beginning.
The documentation is updated.
Hi,
I have two questions that I couldn't answer by reading the documentation: 1) Can I specify an output path where the result will be written to? (Or where can I find the log file?) 2) Does the row_id and column_id of the result start at 0 or at 1?
Thank you for your help.
Best regards, Felix