Issue on `mock_data_core`

CORE-forge / coresoi

R package for CORE set of indicators

https://core-forge.github.io/coresoi/

MIT License

13 stars 1 forks source link

Issue on `mock_data_core` #9

Open simonedelsa opened 1 year ago

simonedelsa commented 1 year ago

Just started to take a look at the new version of code for computing the indicators (I started with indicator #1). It's good!!! Moreover, in the light of the last meeting, in general the code needs to be integrated with the suitable test(s).

In order to test the resulting code on 'real data', provided guide suggests to use 'mock_data_core', but it contains rows with the same cig (ID contract) and this definitely affects the final results. This can be fixed in two ways:

i. by envisaging the code to remove dupliated rows according to the variables needed for computing the indicator; ii. by building test data so that one row = one contract and using columns with nested objects when relationship 1-n arises (e.g., for winner companies, modifications) --> preferred by me, but I'm afraid it weighs too much, doesn'it?

giuliogcantone commented 1 year ago

Correct me if I am wrong, is the problem of this kind?

tibble(
contract = c("A,"A"),
field_1 = c(1,NA),
field_2 = c(NA,2)
) -> A

Because if this is the issue we just need a function to compact the mock + the data management process. First thing coming into my mind:

A |>
group_by(contract) |>
summarise(across(everything,first_not_na))
)

(right now I do not remember how to express first_not_na but you got it).

I am sure there is even more implicit code in dplyr::.

NiccoloSalvini commented 1 year ago

@simonedelsa is definitely right about that, we need to keep it unique in order to generate indicators. There are duplicates because, in the pipeline I wrote, when you join tables with cf_stazione_appaltante (upper tables cluster: CENTRI DI COSTO, MODALITA REALIZZAZIONE, TIPO SCELTA CONTRAENTE) from ANAC, and for id_aggiudicazione (left tables cluster, FONTI FINANZIAMENTO, QUADRO ECONOMICO etc) sometimes you are going to have more than 1 results for the same cig. But since I also deselected some of the columns (they are not of interest for us) you are not going to notice it and as a result they look exactly the same rows. Either we do something like @giuliogcantone suggests or we modify the pipeline such that we only retain the first result. We might want also to exclude upper and left tables so that we avoid row duplicated by design.