achouhan93 / Data-Wrecker

Repository for Data-Wrecker-Framework Project
1 stars 2 forks source link

Paneesh: Analysis of Data Profiling tools and Data Preparation Tools #4

Closed achouhan93 closed 5 years ago

panishvp commented 5 years ago

Data Profiling features:

  1. Date formats
  2. String Length
  3. Mathematical calculations
  4. Null empty checks
  5. Country codes
  6. Zip codes/postal codes
  7. Address formats
  8. Gender data formats(male/M)
  9. Duplication of data
  10. Timestamp
  11. Dealing with the different formats of input files
  12. Regex for certain data entries

Talend Tool Analysis

Talend is a tool used for data cleaning purpose. Basically, this tool detects the datatype of each column and says which are the entries in a column are valid and invalid.

In the figure above the green shows that the data entries are valid and yellow shows that the entries are invalid.

We can also observe that the tool not only detected the type of data entry but also have analysed to which category that the data belongs to.

This implies that the algorithm first reads a particular column and performs an operation through which it classifies the datatype first later based on the datatype it predicts what kind of data does this column contain.

For example suppose there is a column with countries the data type detected is String and the data in the form of string will be classified into country based on the libraries the the tool has .