BigDaMa / abstraction-layer

Apache License 2.0
4 stars 8 forks source link

Questions #2

Closed FelixNeutatz closed 5 years ago

FelixNeutatz commented 6 years ago

Hi,

I have two questions that I couldn't answer by reading the documentation: 1) Can I specify an output path where the result will be written to? (Or where can I find the log file?) 2) Does the row_id and column_id of the result start at 0 or at 1?

Thank you for your help.

Best regards, Felix

m-mahdavi commented 6 years ago

Hi,

  1. The results are not written anywhere. When you call the method run_data_cleaning_job(run_input), it returns the list of detected cells. So, you can write the results whenever you wish.
  2. Yes. We always consider the first row and column as 0. However, notice that the first row of each dataset is usually the header. So, the first tuple of data would be row 1.

Kind regards, Mohammad

FelixNeutatz commented 6 years ago

Ok, thank you :)

In my opinion, it would be better to always call the first data row the 0th row, no matter whether there is a header or not.

FelixNeutatz commented 6 years ago

I just saw that you assume that the data has a header:

for row, column in detected_cells_list:
            i = int(row)
            j = int(column)
            v = None
            if (i, j) not in cell_visited_flag and i > 0:
                cell_visited_flag[(i, j)] = 1
                return_list.append([i, j, v])

Because you only add a cell if i > 0. Is this correct?

m-mahdavi commented 6 years ago

I think it is an implicit standard in data cleaning tools that the first row of dataset is row 0, regardless of its content that could be header or data. Some of the tools such as dboost may output row 0 as error. However, we assume that dataset always has a header. So, we are not interested to detect the name of attributes as errors. That is why the condition i > 0 exists.

Note that, we consider the following assumptions for input dataset:

FelixNeutatz commented 6 years ago

Great answer :) It would be great if we can create a wiki page for this repository. There we can put among other tutorials also these assumptions. That would be super helpful.

m-mahdavi commented 6 years ago

Sure. Remember that this project is going to be extended as our data cleaning framework that contains all of our projects as different data cleaning services. Therefore, we all need to have same standards and use same APIs for different basic tasks. Of course, we need to create wiki and also discuss about different aspects that we need to consider. The current project is just a beginning.

m-mahdavi commented 5 years ago

The documentation is updated.