Closed LeandroGripp closed 2 years ago
O módulo acabou dividido em 4 arquivos Python e 1 script shell principal. O script shell é o que deve ser chamado, passando como seu único argumento o filepath para o arquivo de configuração a ser utilizado. Esse módulo chama 2 scripts Python:
data-fetching.py
que, de acordo com o arquivo de configuração, busca os arquivos no hadoop e os aloca em um diretório temporáriodata-processing.py
que, para cada um dos arquivos a ser tratados, instancia as classes DataCleaning (definida em DataCleaning.py
) e DataFiltering (definida em DataFiltering.py
) e executa os tratamentos adequados para cada arquivo, de acordo, mais uma vez, com os arquivo de configuração.O produto desses tratamentos é colocado no diretório de saída, com os nomes desejados, como especificado no arquivo de configuração.
Data cleaning and filtering
This module is responsible for the data cleaning and filtering part of the pipeline being developed by the M04 team of the PCA - MPMG program during the year of 2021.
It executes the following tasks:
Requirements for running
Initial setup
When first installing the module, it is necessary to set its environment variables to the correct installation directories in the given machine. The following variables must be set:
Usage
Once properly configured with the aforementioned environment variables, this module's behavior is as described initially, fetching the data from Hadoop, processing it and outputting it. The base hadoop URL, the input and output filenames and the output directory are to be specified in the config.json file, as well as the desired filters. If a given filter is not to be applied, it suffices to leave it as an empty string. The module is used by calling the main shell script,
clean-filter-data.sh
, passing as it's only argument the config file.Currently, the
"extraTreatments"
available are:"maxValue"
: drops any bidding which exceeds the stipulated maximum value"startDate"
: drops any bidding which happens before the stipulated start date."endDate"
: drops any bidding which happens after the stipulated end date."validate1CNPJ"
: drops any bidding participation from a CNPJ that is deemed invalid"validate2CNPJs"
: drops any bond between CNPJs in which at least one is deemed invalid