MPMG-DCC-UFMG / M04

Mineração de Grafos como Suporte para a Construção de Trilhas
GNU General Public License v3.0
1 stars 1 forks source link

Generalize the module #36

Closed LeandroGripp closed 2 years ago

LeandroGripp commented 2 years ago

Data cleaning and filtering

This module is responsible for the data cleaning and filtering part of the pipeline being developed by the M04 team of the PCA - MPMG program during the year of 2021.

It executes the following tasks:

  1. Fetches the raw data from the configured Hadoop filesystem
  2. Cleans the data, excluding null values, repeated values and invalid values
  3. Filters the data, be it by date or by maximum bidding value (other filter may be added in the future)
  4. Outputs the treated files to the defined output directory.

Requirements for running

Initial setup

When first installing the module, it is necessary to set its environment variables to the correct installation directories in the given machine. The following variables must be set:

HADOOP_HOME: Hadoop installation directory
HADOOP_CONF_DIR: Hadoop config directory
HDP_VERSION: Hadoop version
HADOOP_USER_NAME: Hadoop username
YARN_CONF_DIR: Yarn configuration directory
M04_PATH: Path to the root of the 2021 M04 pipeline

Usage

Once properly configured with the aforementioned environment variables, this module's behavior is as described initially, fetching the data from Hadoop, processing it and outputting it. The base hadoop URL, the input and output filenames and the output directory are to be specified in the config.json file, as well as the desired filters. If a given filter is not to be applied, it suffices to leave it as an empty string. The module is used by calling the main shell script, clean-filter-data.sh, passing as it's only argument the config file.

// config.json 
{
  "baseHadoopURL": "hdfs:hadoop-base-url-here",
  "outputDirectory": "path-to-output-directory-here",
  "startDate": "20/09/2020", // if not necessary, leave as an empty string. The format should be DD/MM/YYYY
  "endDate": "20/09/2021", // if not necessary, leave as an empty string. The format should be DD/MM/YYYY
  "maxValue": "10000000", // if not necessary, leave as an empty string
  "files": [ 
  // array that lists all of the files to be fetched and processed
  // each of the entries should have the following format
    {
      "fileName": "read-database-licitacao.csv", // Filename as is in the Hadoop Filesystem
      "outputFileName": "read-database-licitacao.csv", // Desired output filename, once treated
      "columns": ["id_licitacao", "nome_orgao_show", "sigla_uf", "nome_modalidade_show", "num_modalidade", "ano_referencia", "mes_referencia", "dsc_objeto_licitacao_show", "vlr_licitacao"], 
      // Array of columns to be kept. If all columns are to be kept, leave as an empty array
      "extraTreatments": ["maxValue", "startDate", "endDate"] 
      // Array of extra treatments that can be applied to the data. If no treatment is to be applied, leave it as an empty array
    }
  ]
}

Currently, the "extraTreatments" available are:

LeandroGripp commented 2 years ago

O módulo acabou dividido em 4 arquivos Python e 1 script shell principal. O script shell é o que deve ser chamado, passando como seu único argumento o filepath para o arquivo de configuração a ser utilizado. Esse módulo chama 2 scripts Python:

O produto desses tratamentos é colocado no diretório de saída, com os nomes desejados, como especificado no arquivo de configuração.