ESHackathon / CiteSource

http://www.eshackathon.org/CiteSource/
GNU General Public License v3.0
16 stars 1 forks source link

Align argument name for citation tibble across functions #190

Closed LukasWallrich closed 1 week ago

LukasWallrich commented 1 month ago

Currently, in some functions, the argument taking the citations tibble is called unique_citations (e.g., compare_sources()) while is is simply named citations in others (e.g., citation_summary_table()). In order to have a clean user interface, we should probably align this.

Given that the entire package works on deduplicated citations, I would personally suggest to call it citations (and document that the argument refers to deduplicated citations) - but happy to call it unique_citations as well. Usually, it won't be named, so it does not make a huge difference - but should still be cleaned up before release as changes to argument names can break existing code if users name the arguments ...

@TNRiley (and anyone else) - which argument name would you prefer? Also, do we have other recurring arguments across functions with different names? I currently don't think so, but that may also be worth checking.

TNRiley commented 1 month ago

I put together a script to pull all the functions and arguments so that we can see them. We have a mix it seems between x, df, citations, unique_citations, data, unique_data, ....

I think cleaning things up is a good idea, but I'm not sure what the best practices are here.

Functions like record_counts and calculate_phase_counts take the output of the dedup_citations (which is normally given the element name "unique_citations") and compare it to the original citations data from read_citations (which is normally given the element name "citations". Obviously, the outputs could be named anything and it might be better to give these functions more flexibility, though I don't see these table functions and others being utilized outside of CiteSource.

For me, Citations and unique_citations give me an understanding of the data being fed into a function. I like to see the named data inputs as it helps me navigate and keep a clear understanding of the inner workings, but I understand that it limits the function by requiring the data to be named accordingly which could be problematic if the functions are used outside of CiteSource.

Folder Function Arguments
R/CiteSource.R .onLoad libname, pkgname
R/compare.R compare_sources unique_data, comp_type, include_references
R/compare.R count_unique unique_data, include_references
R/count.R calculate_phase_count unique_citations, citations, db_colname
R/count.R calculate_record_counts unique_citations, citations, n_unique, db_colname
R/count.R count_sources df, db_colname
R/count.R record_counts unique_citations, citations, db_colname
R/dedup.R add_cols raw_citations, cname
R/dedup.R dedup_citations raw_citations, manual, show_unknown_tags
R/dedup.R dedup_citations_add_manual unique_citations, additional_pairs
R/export.R export_bib citations, filename, include
R/export.R export_csv citations, filename, separate, trim_abstracts
R/export.R export_ris citations, filename, source_field, label_field, string_field
R/helpers.R ui_yeah x, yes, no, n_yes, n_no, shuffle, .envir
R/import.R read_citations files, cite_sources, cite_strings, cite_labels, metadata, verbose, tag_naming, only_key_fields
R/import_export_helpers.R as.bibliography x, ...
R/import_export_helpers.R as.data.frame.bibliography x, ...
R/import_export_helpers.R clean_authors x
R/import_export_helpers.R clean_colnames x
R/import_export_helpers.R clean_df data
R/import_export_helpers.R detect_delimiter x
R/import_export_helpers.R detect_lookup tags
R/import_export_helpers.R detect_parser x
R/import_export_helpers.R detect_year df
R/import_export_helpers.R guess_year x
R/import_export_helpers.R match_columns df
R/import_export_helpers.R merge_columns x, y
R/import_export_helpers.R parse_bibtex x
R/import_export_helpers.R parse_csv x
R/import_export_helpers.R parse_pubmed x
R/import_export_helpers.R parse_ris x, tag_naming
R/import_export_helpers.R parse_tsv x
R/import_export_helpers.R prep_ris z, delimiter, type
R/import_export_helpers.R proportion_delimited x, regex
R/import_export_helpers.R read_ref filename, tag_naming, return_df, verbose, select_fields
R/import_export_helpers.R remove_factors z
R/import_export_helpers.R rollingsum a, n
R/import_export_helpers.R synthesisr_read_refs filename, tag_naming, return_df, verbose, select_fields
R/import_export_helpers.R write_bib x
R/import_export_helpers.R write_refs x, format, tag_naming, file
R/import_export_helpers.R write_ris x, tag_naming
R/plots.R plot_contributions data, facets, bars, color, center, bar_order, facet_order, color_order, totals_in_legend
R/plots.R plot_source_overlap_heatmap data, cells, facets, plot_type, sort_sources, interactive
R/plots.R plot_source_overlap_upset data, groups, nsets, sets.x.label, mainbar.y.label, order.by, ...
R/reimport.R reimport_csv filename
R/reimport.R reimport_ris filename, source_field, label_field, string_field, tag_naming, verbose
R/runShiny.R runShiny app, offer_install
R/tables.R citation_summary_table citations, comparison_type, search_label, screening_label, top_n
R/tables.R generate_apa_citation authors, year
R/tables.R generate_apa_reference authors, year, title, source, volume, issue, doi, weblink, return_html, format_journal_case
R/tables.R precision_sensitivity_table data
R/tables.R record_counts_table data
R/tables.R record_level_table citations, include, include_empty, return, indicator_presence, indicator_absence
R/tables.R record_summary_table data>
LukasWallrich commented 1 month ago

This is super-helpful, thank you! Could you put that function into a gist? I need to do something similar for a different package, and this table is great.

Given that we do keep on using the full citations beyond the dedup stage, I like your suggestion to use different names for the original and deduped version. Should we maybe go for raw_citations and unique_citations so that the difference is very visible?

To be clear: My comment on the use of functions was not about named vs unnamed arguments, but about changes to argument names. People might want to include reproducible CiteSource analyses in their analysis code, and if they decided to call citation_summary_table(citations = my_citations) that code would no longer work when we change to unique_citations - so we should only change this once.

TNRiley commented 1 month ago

@LukasWallrich Here is that gist

I'm not 100% sure I understand the argument name part here. I think changes to names could be helpful, but I do have a ton of scripts for projects that would need to be changed. I agree that some of our function names are a bit odd. We should discuss this next time we tag up. I'm not sure if we're too far down the line to change function names at this point.

I've also found that within the unique_citations the deuplicate_id and record_ids & duplicate_id are confusing. deuplicate_id is really a unique identifier and the record_ids is just a list of the duplicate_id where overlap occurred.

I'm not experienced enough to know how much work would need to go into changing any of this at this point. The other option is just really good documentation?

LukasWallrich commented 1 month ago

I'd say it's not yet too late to change argument names (there, we could also keep the old names as deprecated aliases that continue to work for a while, but give messages asking users to change to the new version) - it might be too late to change function names but we can always add aliases in special cases (for instance, runShiny() did not fit our snake_case style, so that I added run_shiny() as an alternative.) We should also reconsider which functions need to be exported, and where we might want to add a meta-function that combines a few steps (particularly for some of the tables, where it might make sense to have a function that takes unique_citations and produces the table, and takes care of the intermediate summary steps internally.

duplicate_id and record_ids are ASySD fields, and it might be more difficult to change them - though I agree that duplicate_id could be better named, so that is something for @kaitlynhair to decide. Of course, we could change them afterwards within CiteSource, but should then do that in a way that is very different from ASySD to avoid confusion.

I think I made a minor change in my suggestion to ASySD that means that record_ids are concatenations of all the original record_id fields, while they got sometimes mixed with (former) duplicate_ids during manual deduplication in the previous code - so they would never be derived from duplicate_ids per se, and should not be explained as such ...

TNRiley commented 1 month ago

This makes sense to me now, thanks Lukas. I agree with you that it's not too late to change argument names. I've gone through the functions and created a list where the functions rely on the deduplicated data and what the arguments are named.

Here are the various argument names we use in functions where the _result of dedupciations are called. IF we change argument names these should all be unique_citations

compare_sources (unique_data) count_unique (unique_data) count_sources (df) Not exported - used in record_counts and calculate_record_counts and calls both unique_ciations and citations. record_counts (unique_citations) calculate_record_counts (unique_citations) calculate_phase_count (unique_citations) dedup_citations_add_manual (unique_citations) citation_summary_table (citations) this is definitely in need of change! record_level_table (citations) this is definitely in need of change! also see #197 on the addition of filtering arguments

export_bib (citations) this is definitely in need of change! export_ris (citations) this is definitely in need of change! esport_csv (citations) this is definitely in need of change!

TNRiley commented 1 month ago

Here are the various argument names we use in functions where the raw citation data is called (currently called citations) IF we change these argument names these should all be raw_citations

record_counts (citations) calculate_record_counts (citations) calculate_phase_count (citations) dedup_citations (raw_citations) add_cols (raw_citations) - not exported used in dedup_citations

TNRiley commented 1 month ago

Another item related to this. Both the reimport_csv and reimport_ris explicitly name the imported data "citations". IF we go ahead with the renaming, I think that these should be changed to name the reimported data unique_citations.

LukasWallrich commented 1 month ago

This is purely internal - we can rename it, but the user will never see that. In the example, we already assign the result to unique_citations.

TNRiley commented 1 week ago

I'm pretty sure I've got everything renamed. they are now raw_citations and unique_citations.