NutritionTools V2 thoughts and ideas

TomCodd commented 10 months ago

Updates to existing Functions

General

Tidyverse integration. The best way to pass a column name as an argument to a tidyverse function is convert it to quosure using enquo()

Fuzzy Matcher

Add capability to include extra metadata column (e.g. source FCT for multimatches) in matching screen
Add input - output file name. Will (hopefully) allow for FM integration into a script, providing a script can continue after an R Shiny window is opened and closed. Need to check whether this works for various ways of closing a Shiny window
Liberty's idea - any unmatched items automatically go to a new window, to be matched against the whole list (like the function Liberty is working on)
Multimatch - add an alternative tickbox column, but this one doesn't lock as soon as an option is picked

New Functions

Automatic Renamer

Reads in all columns
Fuzzy Matches against a lookup table included in the package (the table is based on the name changes we made in the Fisheries project), and renames automatically
Outputs the results, showing the changes - the least accurate changes first
Show how to correct names if not correct
Set up all calculator functions (past and present) to default to these new names, for ease of use. Ideally, if this script is run first, a lot of the other scripts will not need column names as inputs, as it will know them.
Also set all columns to correct type (numeric etc)

Header condenser

Might already exist?
Usecase is when an FCT spreads its header across multiple rows
input is the number of rows its spread over
Combines the rows and header

Irregularity Checker

lists all columns or rows that contain special characters, is mostly empty, any other irregularities.

Empty Remover

Janitor already does something like this, but only for NA's - create one that will go over an entire dataset and not only remove columns and rows only populated by NA's, but those only populated by blank cells too.

Unit Checker

Once the headers are renamed using the automatic renamer, this can be done
Compares each nutrients's numeric data against a dataset already in the package, which describes the bell curve of expected values that that nutrient should lie within
If it falls outside the parameter, flags it for checking - in case its using a non-standard unit

Grouping_Finder

Reads an FCT, looking for structure irregularities that might indicate in-table grouping (e.g. rows that only have one cell filled)
Either take a column as input (if theres one column that contains the irregular cells) or checks automatically
Automatically creates a column for the cell and atributes it

Subfunctions

Column Input Checker

Used in other functions for input - allows more flexibility with inputs which are meant to refer to column names.
e.g., if you need to refer to the column "Carbs" in position 4, this function would mean the user could input "Carbs", Carbs, df$Carbs, df$4, or 4, and it would convert it to a standard that the function will use.

TomCodd commented 10 months ago

@LuciaSegovia sorry, forgot to tag you! not urgent at all, just so you can have a look if/when you want :)

LuciaSegovia commented 10 months ago

Thank you @TomCodd ! This is super helpful!

Adding Function

This funtions are already created just need to be documented, or the need minor adjustment:

Carbohydrates calculator
Vitamin A (RA & RAE)
Sum of Proximate
Variable recombinator (e.g, Vitamin B6 standardised (VITB6Amg, VITB6Cmg, VITB6_mg)

New functions

Implausible values checker: Plot the data and provide the food id & food name with ranges outside the thresholds values. This function should be food group specific. For instance, fibre > 0 in fish.

TomCodd commented 10 months ago

Hi @LuciaSegovia ! :)

Thanks for this, great ideas! :)

So 3 are basic calculators, and one (the B6 standardised) sounds like it would be a similar structure to the CARTBEQ standardiser? https://tomcodd.github.io/NutritionTools/reference/CARTBEQ_standardised.html

The implausible values checker sounds like the unit checker I suggested above - I was worried about making it too specific (i.e. to food group) because thats something we haven't standardised (...yet), and so its going to be difficult to rename, and difficult to assume that the groups a certain FCT uses are similar enough to the food groups we have already processed to generate the rules (e.g. fibre > 0). It should be easy to create a function which is completely internal, and creates a bell curve based on the items within the food groups the FCT being studied outlines, and can look for outliers in that group? It just wouldn't be able to apply external rules or compare the values to external datasets?

LuciaSegovia commented 10 months ago

Hi @TomCodd!

So the three calculators are already developed, just need to be added to the package. Then, for the "standardiser" is a loop that combine food components that can be reported using different Tagnames, and we have used for vitamin B6, fat, and thiamine. But, it would probably will be used in the future for fibre and folate as well.

Re: implausible values. By implausible, I really mean impossible, like alcohol in broccoli, or carbohydrates in raw fish.

I think the standardisation of the food group is something that we should do anyway. But, even if we don't standardise the groups, we could create the function with a standard food group list, then the user could input the column and the food group names matched to the food groups to be used with the function. I am not sure if I am explaining myself, but happy to discuss in the near future :)

Then, the internal bell shape (which I have my concerns that we have enough data to create any curve), would be good for outliers detection.

TomCodd commented 10 months ago

Hi @LuciaSegovia !

OK - just need to apply the data checks that the current calculators have, should be easy enough :)

I agree that creating a standardised list of food groups is a good idea, but I don't see how we can enforce it - there are too many different systems out there. Kenya has 15 food groups for example, UK21 was 124, unless you use the primary groups, in which case it has 14. US19 has 25.

Even if they have a similar number of groups/primary groups, some of the items are split over them. If you take an Egg for example, in the US19 dataset thats in the 'Dairy and Egg products' food group, but in KE18 its in 'Meats, Poultry and Eggs', which has no overlap with the 'Milk and dairy products' group that US19 has eggs merged with. In the UK its different again, with 'Eggs' having its whole own primary group.

This is the problem I couldn't think my way around when I was thinking of the possible new functions - theres too much overlap and too many irregular decisions to be able to come up with these rules, even with the user inputting the food group names themselves. I'd be more than happy to discuss it if you think there is a solution I'm missing, or if you have any other ideas though :)

I suppose one possible way of doing it would be to get the user to create these rules manually? So they could select the food group from their FCT's list, and then the rule they would expect that food group to follow, and any items that don't are outputted? But that then essentially becomes a glorified filter, and so we might be better showing the users how to use that instead...

I'm happy to discuss this too :)

LuciaSegovia commented 10 months ago

Indeed @TomCodd ! I was thinking that the users would be the one sorting the food groups/ items by themselves! For our own use, I have some ideas but I think it's out of scope of this issue!

I have a super needed update for the summariser!! It's to relax the weight checker. I have implemented it into my own function but I think it should be in the main one. Becuse due to rounding it is very common that the weights lie between .99-1.03.

So the current code look like this

if(remaining_total > 0.03){ message(paste0("Error - weighting values for item ID ", unique(sorted_table[[group_ID_col]]), " are greater than 1. Weighting cannot be completed.")) stop() } weightings_column[is.na(weightings_column)] <- remaining_total/number_of_NA sorted_table[[input_weighting_column]] <- as.numeric(weightings_column) weighting_total <- sum(sorted_table[[input_weighting_column]]) if(weighting_total < 0.990 | weighting_total > 1.03 ){ message(paste0("Error - weighting values for item ID ", unique(sorted_table[[group_ID_col]]), " do not total 1. Weighting cannot be completed.")) stop() }

I have adapted to:

if(remaining_total > 0.03){ message(paste0("Error - weighting values for item ID ", unique(sorted_table[[group_ID_col]]), " are greater than 1. Weighting cannot be completed.")) stop() } weightings_column[is.na(weightings_column)] <- remaining_total/number_of_NA sorted_table[[input_weighting_column]] <- as.numeric(weightings_column) weighting_total <- sum(sorted_table[[input_weighting_column]]) if(weighting_total < 0.990 | weighting_total > 1.03 ){ message(paste0("Error - weighting values for item ID ", unique(sorted_table[[group_ID_col]]), " do not total 1. Weighting cannot be completed.")) stop() }

TomCodd commented 10 months ago

Hi @LuciaSegovia ! Ahh, ok! Sorry, I misunderstood :)

Hmmm, ok! Just to check, the remaining_total is only relevant to if the weightings are set by the user - and currently in that situation there is no rounding applied (athlough I should add that to be honest - to the values that the function calculates to fill in the gaps evenly, if there is a gap in the existing weightings). Whats it doing wrong? And did the changes you made fix the problem? I'm a little purplexed, and interested in whats going on! :)

LuciaSegovia commented 9 months ago

Hi @TomCodd!

Sorry! For rounding I meant, the one it is done when generating (manually) the weights, so it will happen outside the group summariser function. But, I think it is important to fix this, becuase we are using manual weights almost every time we are using the summariser to be honest.

I think I changed it when we were setting the weights manually for certain foods when we prepared the NCTs for the FAO. But, more recently, I used to generate the NCT for Tanzania for TFNC, where I used some data from a different survey to allocate weights to the TZ food matches. You can check the weight allocation here, and you can see the summariser being applied for that piece of work here.

The problem is that as it stands the summariser is not running when the total of the weights are not exacltly 1 or 0. It stops all together, even if the sum of the weights is 0.99. Hence, for it to work I had to make the total weight checker a bit more flexible (i.e, between .99 to 1.03), so the summariser can run and produce the table.

Does it make sense?

Maybe you can find a more elegant way to solve the issue, but so far now it's working 😅

Thanks!

TomCodd commented 9 months ago

Hi @LuciaSegovia !

Ahh, I see now! Sorry, I was assuming it was all happening inside the function, not that external rounding may have this affect :)

I'll have a think for a more elegant solution. I suspect its going to be to, if external weighting is used AND if every item in a group has a weighting value, ignore the rounding. But if theres weighting values for some of them, and there are some which are left blank (which the Summariser will split the remainder across equally), and the total of the few that have the weighting values already is >1, it'll still flag - which I think is fair enough, as if its coming across this issue in that situation its obvious the user has not done the weightings correctly?

I'll also include a "forgiveness" value as an optional input - this will be a value that the weighting total can exceed/come short of 1 by (in this case 0.03, for an accepted range of 0.97-1.03).

Do you think that would be a sensible solution? :)

LuciaSegovia commented 9 months ago

Hi @TomCodd!

As long as the forgiveness value is there I'm happy with it 🤣 Do you know when you will be updating the function?

Thanks!

TomCodd commented 9 months ago

Hi @LuciaSegovia !

This should now be live, in version 1.0.1 :) Please can you let me know how it goes? :)

LuciaSegovia commented 9 months ago

Hi @TomCodd!

Just checked and it's not working. I got this messange:

Error - weighting values for item ID 1001 are greater than 1 + weighting_leniency. Weighting cannot be completed. Error in Group_Summariser(nct_duplicates, "item_id", input_weighting_column = "Wt",

And the total weight of item ID 1001 is 1.0001.

Thanks!

TomCodd commented 9 months ago

Hi @LuciaSegovia !

What input did you use? what was the weighting_leniency set to? Its a new input for the function, but defaults to 0 :)

LuciaSegovia commented 9 months ago

Oh Sorry @TomCodd! I didn't know that it was an option. Let me change it and try again.

Ok, I tried with weighting_leniency = 0.01 and I got this message now:

Error - weighting values for item ID 1001 do not total 1. Weighting cannot be completed.Error - weighting values for item ID SUMMARY ROW - 1001 do not total 1. Weighting cannot be completed.Error - weighting values for item ID do not total 1. Weighting cannot be completed. Error in Group_Summariser(nct_duplicates, "item_id", input_weighting_column = "Wt", :

TomCodd commented 9 months ago

Hi @LuciaSegovia !

Sorry, thats definitely a problem - can you send me the dataset you were trying it with (i.e. save it as an R data file just before it was input into the Summariser) , and the group summariser command you were running? I'll see if I can sort it :)

TomCodd commented 1 month ago

FM update idea/column renamer idea:

ID and food name columns will be the ones with the most unique rows - and should contain the exact same number of unique rows. If both are present can search for words which always appear in FCT's/Surveys - Raw, certain food items etc (need to search to find which ones appear the most). The column with those in will be the food name.

In the column renamer, can then do a lookup table for the remaining items, potentially a fuzzy match/stringdist with the user inputting the choices through the terminal.

The FM can use this to negate the need for the user to strip down the dataset manually before using the FM.

Can also detect the fct column, if present. If not, can create an fct column. Can do so by finding terms or entries or ID structures which only appear in a single fct.

Add the ability to differentiate between matches in a heirachical order based on the FCT, in case of multimatch. User would need to have an FCT column, and the stringdist leniency would need to be found. Or just a priority FCT, so the top match from that FCT is always shown. (Or 2 matches, user preference).

Or, overhaul - do the user interaction through the console. No shiny table, should be much faster. And fewer dependencies.

@LuciaSegovia Any thoughts? Just some ideas that keep rattling around my head as I'm shovelling 😂

TomCodd / NutritionTools