aim-rsf / browseMetadata

An R package to help a researcher browse metadata for health datasets and categorise variables based on research domains
https://aim-rsf.github.io/browseMetadata/
GNU General Public License v3.0
3 stars 1 forks source link

Re-work the csv output logs & validate user inputs #104

Closed RayStick closed 3 weeks ago

RayStick commented 1 month ago

❗ Sorry, this is a big PR ... not the best practice ...

Closes #102

Closes #91

Closes #97

Closes #82

Other changes:

Checklist before review:

Rainiefantasy commented 3 weeks ago

Package version number has now been included in the csv output logs. This could be important to track which version number created the csv outputs, particularly if the compare_csv_outputs command is being used to compare outputs created from different package versions

Is this the browseMetadata column - I can see v 1.0.1 so that's great 😸 !

When doing this, I realised it made sense to split the csv output log into two files not one - one file is about the session itself (LOG) and the other csv file contains the actually categorisations made by the user (OUTPUT). Previously there was a lot of repetition in the OUTPUT csv file. They can be linked via timestamp column.

That's fair enough - the timestamp is in the filename as well so makes sense can use it as an identifier πŸ‘ Can see both files, looks good!

Because there are now two csv outputs, the input arguments for compare_csv_outputs had to be adjusted See the README which reflects these changes to the code

When trying to run the compare function: compare_csv_outputs('LOG_NationalCommunityChildHealthDatabase(NCCHD)_BLOOD_TEST_2024-07-05_16-07-38.599493.csv','OUTPUT_NationalCommunityChildHealthDatabase(NCCHD)_BLOOD_TEST_2024-07-05_16-07-38.599493.csv','LOG_NationalCommunityChildHealthDatabase(NCCHD)_BLOOD_TEST_2024-07-08_12-03-30.429336.csv','OUTPUT_NationalCommunityChildHealthDatabase(NCCHD)_BLOOD_TEST_2024-07-08_12-03-30.429336.csv','Documents/GIT/browse-SAIL/data-raw/national_community_child_health_database_(ncchd)_20240405T130125.json') I get a prompt for a mismatch and to 'β„Ή Provide concensus decision for this DataElement:' but when I enter a number (one of the two domain classifications) it errors with: Error in user_categorisation(selectTable_df$Label[datavar], selectTable_df$Description[datavar], : argument "domain_code_max" is missing, with no default Not sure if I'm doing something wrong here? Other than that, I'm not sure why you need all four files as opposed to just having the two classification files as you are comparing the output - was there a reason for this?

Rainiefantasy commented 3 weeks ago

See line 143 of domain_mapping which allows the user to add an optional free text note about this table. They always have the option to skip a table and not process it as well

This looks good to me! ✨

From working on compare_csv_outputs I don't see an issue with two output files having the same initials and therefore think this issue can be closed - let me know if you spot anything.

Sounds good! l agree- means the function can be used as a sanity check for a user as well :)

This commit https://github.com/aim-rsf/browseMetadata/commit/0dc8c36eb39748bfe943dc3a8320b1a36c3b7b45 made changes to the user_categorisation to validate and standardize user inputs. Smaller changes were made to domain_mapping as there is now a 4th argument for user_categorisation

Only looked briefly at code here but looks ok at glance πŸ‘

DESCRIPTION file version number brought up to date Fixed a bug in compare_csv_outputs and validated that all the checks work I seem to still have issues with that function, can double check I've got the latest version

Rainiefantasy commented 3 weeks ago

Another minor feedback - I may have suggested it'd be good to have a choice to amend classifications in case of a mistake, but the more I'm using it, I'm wondering if it's necessary after every classification. Personally I think having just one prompt at the end to check your classifications and if you're happy or want to amend, is enough. See what you think though this is just my experience in using it 😸

RayStick commented 3 weeks ago

Another minor feedback - I may have suggested it'd be good to have a choice to amend classifications in case of a mistake, but the more I'm using it, I'm wondering if it's necessary after every classification. Personally I think having just one prompt at the end to check your classifications and if you're happy or want to amend, is enough. See what you think though this is just my experience in using it 😸

I wondered if it was necessary as well, but actually both Dan and Nida suggested this feature would be useful. I think when I am in 'test mode' I find the amount of prompts given to me as the user annoying XD but that's because I am doing it quickly. Going through it properly with a real use-case, I can see the extra functionality to correct things can be good, so I'll leave this functionality in for now.

RayStick commented 3 weeks ago

@Rainiefantasy your errors with the compare_csv_outputs.R function were due to me not updating the call to user_categorisation function. See line 134 of https://github.com/aim-rsf/browseMetadata/pull/104/commits/9cad2f6a43126662ec250356478b5d25bbdfb2db where I have now added 'max(Code$Code)' as an input argument, and this solved the error you got.

To answer your question 'I'm not sure why you need all four files as opposed to just having the two classification files as you are comparing the output - was there a reason for this?'

Finally, I changed the function name to compare_sessions

What do you think? Can you run again with same inputs please and see if it no longer errors? Thanks!

Rainiefantasy commented 3 weeks ago

I wondered if it was necessary as well, but actually both Dan and Nida suggested this feature would be useful. I think when I am in 'test mode' I find the amount of prompts given to me as the user annoying XD but that's because I am doing it quickly. Going through it properly with a real use-case, I can see the extra functionality to correct things can be good, so I'll leave this functionality in for now.

Just spotted this, that's no problem - it's mostly a preference thing so I guess it doesn't matter. πŸ˜† But I agree, it's probably because of the testing you notice it more!

RayStick commented 3 weeks ago

Thanks for reviewing this long PR @Rainiefantasy - appreciate it!