Re-work the csv output logs & validate user inputs

RayStick commented 1 month ago

❗ Sorry, this is a big PR ... not the best practice ...

Closes #102

Package version number has now been included in the csv output logs. This could be important to track which version number created the csv outputs, particularly if the compare_csv_outputs command is being used to compare outputs created from different package versions
When doing this, I realised it made sense to split the csv output log into two files not one - one file is about the session itself (LOG) and the other csv file contains the actually categorisations made by the user (OUTPUT). Previously there was a lot of repetition in the OUTPUT csv file. They can be linked via timestamp column.
Because there are now two csv outputs, the input arguments for compare_csv_outputs had to be adjusted
See the README which reflects these changes to the code

Closes #91

See line 143 of domain_mapping which allows the user to add an optional free text note about this table. They always have the option to skip a table and not process it as well

Closes #97

From working on compare_csv_outputs I don't see an issue with two output files having the same initials and therefore think this issue can be closed - let me know if you spot anything.

Closes #82

This commit https://github.com/aim-rsf/browseMetadata/pull/104/commits/0dc8c36eb39748bfe943dc3a8320b1a36c3b7b45 made changes to the user_categorisation to validate and standardize user inputs
Smaller changes were made to domain_mapping as there is now a 4th argument for user_categorisation

Other changes:

DESCRIPTION file version number brought up to date
Fixed a bug in compare_csv_outputs and validated that all the checks work

Checklist before review:

[x] Please comment on my PR while it's a draft and give me feedback on the development!
[x] I added everything I wanted to add to this PR, please review!
[x] The title of this PR is clear and self-explantory.
[x] I added any appropriate labels to this PR.

Rainiefantasy commented 3 weeks ago

Package version number has now been included in the csv output logs. This could be important to track which version number created the csv outputs, particularly if the compare_csv_outputs command is being used to compare outputs created from different package versions

Is this the browseMetadata column - I can see v 1.0.1 so that's great 😸 !

When doing this, I realised it made sense to split the csv output log into two files not one - one file is about the session itself (LOG) and the other csv file contains the actually categorisations made by the user (OUTPUT). Previously there was a lot of repetition in the OUTPUT csv file. They can be linked via timestamp column.

That's fair enough - the timestamp is in the filename as well so makes sense can use it as an identifier 👍 Can see both files, looks good!

Because there are now two csv outputs, the input arguments for compare_csv_outputs had to be adjusted See the README which reflects these changes to the code

When trying to run the compare function: compare_csv_outputs('LOG_NationalCommunityChildHealthDatabase(NCCHD)_BLOOD_TEST_2024-07-05_16-07-38.599493.csv','OUTPUT_NationalCommunityChildHealthDatabase(NCCHD)_BLOOD_TEST_2024-07-05_16-07-38.599493.csv','LOG_NationalCommunityChildHealthDatabase(NCCHD)_BLOOD_TEST_2024-07-08_12-03-30.429336.csv','OUTPUT_NationalCommunityChildHealthDatabase(NCCHD)_BLOOD_TEST_2024-07-08_12-03-30.429336.csv','Documents/GIT/browse-SAIL/data-raw/national_community_child_health_database_(ncchd)_20240405T130125.json') I get a prompt for a mismatch and to 'ℹ Provide concensus decision for this DataElement:' but when I enter a number (one of the two domain classifications) it errors with: Error in user_categorisation(selectTable_df$Label[datavar], selectTable_df$Description[datavar], : argument "domain_code_max" is missing, with no default Not sure if I'm doing something wrong here? Other than that, I'm not sure why you need all four files as opposed to just having the two classification files as you are comparing the output - was there a reason for this?

Rainiefantasy commented 3 weeks ago

See line 143 of domain_mapping which allows the user to add an optional free text note about this table. They always have the option to skip a table and not process it as well

This looks good to me! ✨

From working on compare_csv_outputs I don't see an issue with two output files having the same initials and therefore think this issue can be closed - let me know if you spot anything.

Sounds good! l agree- means the function can be used as a sanity check for a user as well :)

This commit https://github.com/aim-rsf/browseMetadata/commit/0dc8c36eb39748bfe943dc3a8320b1a36c3b7b45 made changes to the user_categorisation to validate and standardize user inputs. Smaller changes were made to domain_mapping as there is now a 4th argument for user_categorisation

Only looked briefly at code here but looks ok at glance 👍

DESCRIPTION file version number brought up to date Fixed a bug in compare_csv_outputs and validated that all the checks work I seem to still have issues with that function, can double check I've got the latest version

Rainiefantasy commented 3 weeks ago

Another minor feedback - I may have suggested it'd be good to have a choice to amend classifications in case of a mistake, but the more I'm using it, I'm wondering if it's necessary after every classification. Personally I think having just one prompt at the end to check your classifications and if you're happy or want to amend, is enough. See what you think though this is just my experience in using it 😸

RayStick commented 3 weeks ago

Another minor feedback - I may have suggested it'd be good to have a choice to amend classifications in case of a mistake, but the more I'm using it, I'm wondering if it's necessary after every classification. Personally I think having just one prompt at the end to check your classifications and if you're happy or want to amend, is enough. See what you think though this is just my experience in using it 😸

I wondered if it was necessary as well, but actually both Dan and Nida suggested this feature would be useful. I think when I am in 'test mode' I find the amount of prompts given to me as the user annoying XD but that's because I am doing it quickly. Going through it properly with a real use-case, I can see the extra functionality to correct things can be good, so I'll leave this functionality in for now.

RayStick commented 3 weeks ago

@Rainiefantasy your errors with the compare_csv_outputs.R function were due to me not updating the call to user_categorisation function. See line 134 of https://github.com/aim-rsf/browseMetadata/pull/104/commits/9cad2f6a43126662ec250356478b5d25bbdfb2db where I have now added 'max(Code$Code)' as an input argument, and this solved the error you got.

adding 'max(Code$Code)' meant I needed to add a new input argument to this function, 'domain_file' (see line 10 and 15) and the validation check can be removed (old lines 30-34) because it is now just assumed the same domain_file was used in each session
reading in the domain_list file allows for the plot to be shown again, like in domain_mapping (lines 83-90)

To answer your question 'I'm not sure why you need all four files as opposed to just having the two classification files as you are comparing the output - was there a reason for this?'

the LOG_ csv files are read in in order to perform all the validation checks i.e. is it valid for these two csv files to be compared
if all validation checks are passed then the two OUTPUT_ csv files are compared
does that make sense?
I made changes to the input argument structure (see lines 6-10, line 15, 22-27) to make it more intuitive and less repetitive

Finally, I changed the function name to compare_sessions

What do you think? Can you run again with same inputs please and see if it no longer errors? Thanks!

Rainiefantasy commented 3 weeks ago

I wondered if it was necessary as well, but actually both Dan and Nida suggested this feature would be useful. I think when I am in 'test mode' I find the amount of prompts given to me as the user annoying XD but that's because I am doing it quickly. Going through it properly with a real use-case, I can see the extra functionality to correct things can be good, so I'll leave this functionality in for now.

Just spotted this, that's no problem - it's mostly a preference thing so I guess it doesn't matter. 😆 But I agree, it's probably because of the testing you notice it more!

RayStick commented 3 weeks ago

Thanks for reviewing this long PR @Rainiefantasy - appreciate it!

aim-rsf / browseMetadata

Re-work the csv output logs & validate user inputs #104

Checklist before review: