Questions - Githubissues

FelixNeutatz commented 6 years ago

Hi,

I have two questions that I couldn't answer by reading the documentation: 1) Can I specify an output path where the result will be written to? (Or where can I find the log file?) 2) Does the row_id and column_id of the result start at 0 or at 1?

Thank you for your help.

Best regards, Felix

m-mahdavi commented 6 years ago

Hi,

The results are not written anywhere. When you call the method run_data_cleaning_job(run_input), it returns the list of detected cells. So, you can write the results whenever you wish.
Yes. We always consider the first row and column as 0. However, notice that the first row of each dataset is usually the header. So, the first tuple of data would be row 1.

Kind regards, Mohammad

FelixNeutatz commented 6 years ago

Ok, thank you :)

In my opinion, it would be better to always call the first data row the 0th row, no matter whether there is a header or not.

FelixNeutatz commented 6 years ago

I just saw that you assume that the data has a header:

for row, column in detected_cells_list:
            i = int(row)
            j = int(column)
            v = None
            if (i, j) not in cell_visited_flag and i > 0:
                cell_visited_flag[(i, j)] = 1
                return_list.append([i, j, v])

Because you only add a cell if i > 0. Is this correct?

m-mahdavi commented 6 years ago

I think it is an implicit standard in data cleaning tools that the first row of dataset is row 0, regardless of its content that could be header or data. Some of the tools such as dboost may output row 0 as error. However, we assume that dataset always has a header. So, we are not interested to detect the name of attributes as errors. That is why the condition i > 0 exists.

Note that, we consider the following assumptions for input dataset:

Dataset is a relational table in comma delimiter CSV format.
The first line of dataset is header and the rest form the data matrix.
The header must have only non-space characters as field names with all lowercase values.
The header is row 0 and the first row of data matrix is tuple 1.
Dataset should have data in English language.

FelixNeutatz commented 6 years ago

Great answer :) It would be great if we can create a wiki page for this repository. There we can put among other tutorials also these assumptions. That would be super helpful.

m-mahdavi commented 6 years ago

Sure. Remember that this project is going to be extended as our data cleaning framework that contains all of our projects as different data cleaning services. Therefore, we all need to have same standards and use same APIs for different basic tasks. Of course, we need to create wiki and also discuss about different aspects that we need to consider. The current project is just a beginning.

m-mahdavi commented 5 years ago

The documentation is updated.

BigDaMa / abstraction-layer

Questions #2