TomCodd / NutritionTools

Tools for the Organisation, Matching, Calculation, and Summarisation of Nutrition Data
https://tomcodd.github.io/NutritionTools/
0 stars 1 forks source link

NutritionTools V2 thoughts and ideas #2

Open TomCodd opened 10 months ago

TomCodd commented 10 months ago

Updates to existing Functions

General

Fuzzy Matcher

New Functions

Automatic Renamer

Header condenser

Irregularity Checker

Empty Remover

Unit Checker

Grouping_Finder

Subfunctions

Column Input Checker

TomCodd commented 10 months ago

@LuciaSegovia sorry, forgot to tag you! not urgent at all, just so you can have a look if/when you want :)

LuciaSegovia commented 10 months ago

Thank you @TomCodd ! This is super helpful!

Adding Function

This funtions are already created just need to be documented, or the need minor adjustment:

New functions

TomCodd commented 10 months ago

Hi @LuciaSegovia ! :)

Thanks for this, great ideas! :)

So 3 are basic calculators, and one (the B6 standardised) sounds like it would be a similar structure to the CARTBEQ standardiser? https://tomcodd.github.io/NutritionTools/reference/CARTBEQ_standardised.html

The implausible values checker sounds like the unit checker I suggested above - I was worried about making it too specific (i.e. to food group) because thats something we haven't standardised (...yet), and so its going to be difficult to rename, and difficult to assume that the groups a certain FCT uses are similar enough to the food groups we have already processed to generate the rules (e.g. fibre > 0). It should be easy to create a function which is completely internal, and creates a bell curve based on the items within the food groups the FCT being studied outlines, and can look for outliers in that group? It just wouldn't be able to apply external rules or compare the values to external datasets?

LuciaSegovia commented 10 months ago

Hi @TomCodd!

So the three calculators are already developed, just need to be added to the package. Then, for the "standardiser" is a loop that combine food components that can be reported using different Tagnames, and we have used for vitamin B6, fat, and thiamine. But, it would probably will be used in the future for fibre and folate as well.

Re: implausible values. By implausible, I really mean impossible, like alcohol in broccoli, or carbohydrates in raw fish.

I think the standardisation of the food group is something that we should do anyway. But, even if we don't standardise the groups, we could create the function with a standard food group list, then the user could input the column and the food group names matched to the food groups to be used with the function. I am not sure if I am explaining myself, but happy to discuss in the near future :)

Then, the internal bell shape (which I have my concerns that we have enough data to create any curve), would be good for outliers detection.

TomCodd commented 10 months ago

Hi @LuciaSegovia !

OK - just need to apply the data checks that the current calculators have, should be easy enough :)

I agree that creating a standardised list of food groups is a good idea, but I don't see how we can enforce it - there are too many different systems out there. Kenya has 15 food groups for example, UK21 was 124, unless you use the primary groups, in which case it has 14. US19 has 25.

Even if they have a similar number of groups/primary groups, some of the items are split over them. If you take an Egg for example, in the US19 dataset thats in the 'Dairy and Egg products' food group, but in KE18 its in 'Meats, Poultry and Eggs', which has no overlap with the 'Milk and dairy products' group that US19 has eggs merged with. In the UK its different again, with 'Eggs' having its whole own primary group.

This is the problem I couldn't think my way around when I was thinking of the possible new functions - theres too much overlap and too many irregular decisions to be able to come up with these rules, even with the user inputting the food group names themselves. I'd be more than happy to discuss it if you think there is a solution I'm missing, or if you have any other ideas though :)

I suppose one possible way of doing it would be to get the user to create these rules manually? So they could select the food group from their FCT's list, and then the rule they would expect that food group to follow, and any items that don't are outputted? But that then essentially becomes a glorified filter, and so we might be better showing the users how to use that instead...

I'm happy to discuss this too :)

LuciaSegovia commented 10 months ago

Indeed @TomCodd ! I was thinking that the users would be the one sorting the food groups/ items by themselves! For our own use, I have some ideas but I think it's out of scope of this issue!

I have a super needed update for the summariser!! It's to relax the weight checker. I have implemented it into my own function but I think it should be in the main one. Becuse due to rounding it is very common that the weights lie between .99-1.03.

So the current code look like this

if(remaining_total > 0.03){ message(paste0("Error - weighting values for item ID ", unique(sorted_table[[group_ID_col]]), " are greater than 1. Weighting cannot be completed.")) stop() } weightings_column[is.na(weightings_column)] <- remaining_total/number_of_NA sorted_table[[input_weighting_column]] <- as.numeric(weightings_column) weighting_total <- sum(sorted_table[[input_weighting_column]]) if(weighting_total < 0.990 | weighting_total > 1.03 ){ message(paste0("Error - weighting values for item ID ", unique(sorted_table[[group_ID_col]]), " do not total 1. Weighting cannot be completed.")) stop() }

I have adapted to:

if(remaining_total > 0.03){ message(paste0("Error - weighting values for item ID ", unique(sorted_table[[group_ID_col]]), " are greater than 1. Weighting cannot be completed.")) stop() } weightings_column[is.na(weightings_column)] <- remaining_total/number_of_NA sorted_table[[input_weighting_column]] <- as.numeric(weightings_column) weighting_total <- sum(sorted_table[[input_weighting_column]]) if(weighting_total < 0.990 | weighting_total > 1.03 ){ message(paste0("Error - weighting values for item ID ", unique(sorted_table[[group_ID_col]]), " do not total 1. Weighting cannot be completed.")) stop() }

TomCodd commented 10 months ago

Hi @LuciaSegovia ! Ahh, ok! Sorry, I misunderstood :)

Hmmm, ok! Just to check, the remaining_total is only relevant to if the weightings are set by the user - and currently in that situation there is no rounding applied (athlough I should add that to be honest - to the values that the function calculates to fill in the gaps evenly, if there is a gap in the existing weightings). Whats it doing wrong? And did the changes you made fix the problem? I'm a little purplexed, and interested in whats going on! :)

LuciaSegovia commented 9 months ago

Hi @TomCodd!

Sorry! For rounding I meant, the one it is done when generating (manually) the weights, so it will happen outside the group summariser function. But, I think it is important to fix this, becuase we are using manual weights almost every time we are using the summariser to be honest.

I think I changed it when we were setting the weights manually for certain foods when we prepared the NCTs for the FAO. But, more recently, I used to generate the NCT for Tanzania for TFNC, where I used some data from a different survey to allocate weights to the TZ food matches. You can check the weight allocation here, and you can see the summariser being applied for that piece of work here.

The problem is that as it stands the summariser is not running when the total of the weights are not exacltly 1 or 0. It stops all together, even if the sum of the weights is 0.99. Hence, for it to work I had to make the total weight checker a bit more flexible (i.e, between .99 to 1.03), so the summariser can run and produce the table.

Does it make sense?

Maybe you can find a more elegant way to solve the issue, but so far now it's working 😅

Thanks!

TomCodd commented 9 months ago

Hi @LuciaSegovia !

Ahh, I see now! Sorry, I was assuming it was all happening inside the function, not that external rounding may have this affect :)

I'll have a think for a more elegant solution. I suspect its going to be to, if external weighting is used AND if every item in a group has a weighting value, ignore the rounding. But if theres weighting values for some of them, and there are some which are left blank (which the Summariser will split the remainder across equally), and the total of the few that have the weighting values already is >1, it'll still flag - which I think is fair enough, as if its coming across this issue in that situation its obvious the user has not done the weightings correctly?

I'll also include a "forgiveness" value as an optional input - this will be a value that the weighting total can exceed/come short of 1 by (in this case 0.03, for an accepted range of 0.97-1.03).

Do you think that would be a sensible solution? :)

LuciaSegovia commented 9 months ago

Hi @TomCodd!

As long as the forgiveness value is there I'm happy with it 🤣 Do you know when you will be updating the function?

Thanks!

TomCodd commented 9 months ago

Hi @LuciaSegovia !

This should now be live, in version 1.0.1 :) Please can you let me know how it goes? :)

LuciaSegovia commented 9 months ago

Hi @TomCodd!

Just checked and it's not working. I got this messange:

Error - weighting values for item ID 1001 are greater than 1 + weighting_leniency. Weighting cannot be completed. Error in Group_Summariser(nct_duplicates, "item_id", input_weighting_column = "Wt",

And the total weight of item ID 1001 is 1.0001.

Thanks!

TomCodd commented 9 months ago

Hi @LuciaSegovia !

What input did you use? what was the weighting_leniency set to? Its a new input for the function, but defaults to 0 :)

LuciaSegovia commented 9 months ago

Oh Sorry @TomCodd! I didn't know that it was an option. Let me change it and try again.

Ok, I tried with weighting_leniency = 0.01 and I got this message now:

Error - weighting values for item ID 1001 do not total 1. Weighting cannot be completed.Error - weighting values for item ID SUMMARY ROW - 1001 do not total 1. Weighting cannot be completed.Error - weighting values for item ID do not total 1. Weighting cannot be completed. Error in Group_Summariser(nct_duplicates, "item_id", input_weighting_column = "Wt", :

TomCodd commented 9 months ago

Hi @LuciaSegovia !

Sorry, thats definitely a problem - can you send me the dataset you were trying it with (i.e. save it as an R data file just before it was input into the Summariser) , and the group summariser command you were running? I'll see if I can sort it :)

TomCodd commented 1 month ago

FM update idea/column renamer idea:

ID and food name columns will be the ones with the most unique rows - and should contain the exact same number of unique rows. If both are present can search for words which always appear in FCT's/Surveys - Raw, certain food items etc (need to search to find which ones appear the most). The column with those in will be the food name.

In the column renamer, can then do a lookup table for the remaining items, potentially a fuzzy match/stringdist with the user inputting the choices through the terminal.

The FM can use this to negate the need for the user to strip down the dataset manually before using the FM.

Can also detect the fct column, if present. If not, can create an fct column. Can do so by finding terms or entries or ID structures which only appear in a single fct.

Add the ability to differentiate between matches in a heirachical order based on the FCT, in case of multimatch. User would need to have an FCT column, and the stringdist leniency would need to be found. Or just a priority FCT, so the top match from that FCT is always shown. (Or 2 matches, user preference).

Or, overhaul - do the user interaction through the console. No shiny table, should be much faster. And fewer dependencies.

@LuciaSegovia Any thoughts? Just some ideas that keep rattling around my head as I'm shovelling 😂