Investigate performance and alternatives to clean one-off issues in upstream data

Bisaloo commented 7 months ago

Relevant function is here:

https://github.com/epiverse-trace/sivirep/blob/e91eea2f25facf540be4f1f92c4dd68af5a70853/R/cleaning_data.R#L413-L438

Can we avoid the eval(parse()) and use a design that would allow users to plug in their own data file of issues (excel or other)?

Bisaloo commented 7 months ago

Do we know for sure if datasets are stable / frozen once they are uploaded to SIVIGILA, @GeraldineGomez? We could store a list of fingerprints for each dataset in sivirep to ensure this is always the case.

If so, the simplest option may be to hardcode the specific row numbers we want to exclude for each event.

What do you think?

GeraldineGomez commented 7 months ago

Hi @Bisaloo,

The datasets aren't frozen; they've updated the structure in some cases. Last year, they added three new columns, and the structure depends on the event itself and the year. I've attempted to create unified files with the rules, validations, and exceptions that sivirep needs to consider for cleaning the data. Those are like a map and integrated the conditions from the data dictionary of NIH:

I prioritized them with the columns that sivirep uses to generate the analysis and included the key columns related to the 'Codification of Events in the SIVIGILA document' to simplify the validations. Currently, I'm not taking the year into account as a variable, but it's important, especially because the codification is different in some years, particularly in 2012 and 2016.

I'm not sure if hardcoding these conditions is the best option, We would need to add N conditions for each year and maintain their growth or changes that NIH produces.

Perhaps an option is to generate the conditions from those files, and improve the performance in terms of expression/condition evaluation?

epiverse-trace / sivirep

Investigate performance and alternatives to clean one-off issues in upstream data #106