camaradesuk / ASySD

https://camaradesuk.github.io/ASySD/
GNU General Public License v3.0
12 stars 5 forks source link

Adjust min_id/max_id calculation to support non-numeric record_ids? #28

Closed LukasWallrich closed 1 month ago

LukasWallrich commented 1 year ago

I appreciate the support for non-numeric record IDs - yet currently, that leads to warnings and possibly subsequent errors here. Could that be sort(c(duplicate_id.x, duplicate_id.y))[1] and then below sort(c(duplicate_id.x, duplicate_id.y))[2]? Then it would work for numbers and strings?

tibble(
  title = c(rep(LETTERS[1:3], 2), LETTERS[7:9]), author = c(rep(LETTERS[1:3], 2), LETTERS[7:9]),
  journal = c(rep(LETTERS[1:3], 2), LETTERS[7:9]),
  year = c(rep(2010:2012, 2), 2017:2019), source = c(rep("aa", 5), rep("bb", 4)),
  record_id = letters[1:9], label = ""
) %>% ASySD::dedup_citations()
#> Warning: The following columns are missing: doi, pages, volume, number, abstract, isbn
#> Are you sure you want to proceed? 
#> 
#> 1: Yes
#> 2: No
#> formatting data...
#> identifying potential duplicates...
#> identified duplicates!
#> Joining with `by = join_by(record_id)`
#> flagging potential pairs for manual dedup...
#> Warning: There were 2 warnings in `mutate()`.
#> The first warning was:
#> ℹ In argument: `min_id = min(duplicate_id.x, duplicate_id.y)`.
#> Caused by warning in `min()`:
#> ! no non-missing arguments, returning NA
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
#> Joining with `by = join_by(duplicate_id.x, duplicate_id.y)`
#> 9 citations loaded...
#> 3 duplicate citations removed...
#> 6 unique citations remaining!
#> $unique
#> # A tibble: 6 × 14
#>   duplicate_id record_ids title author journal year  source label   doi   pages
#>   <chr>        <chr>      <chr> <chr>  <chr>   <chr> <chr>  <chr>   <chr> <chr>
#> 1 a            a, d       A     A      A       2010  aa, aa unknow… ""    ""   
#> 2 b            b, e       B     B      B       2011  aa, aa unknow… ""    ""   
#> 3 c            c, f       C     C      C       2012  aa, bb unknow… ""    ""   
#> 4 g            g          G     G      G       2017  bb     unknown ""    ""   
#> 5 h            h          H     H      H       2018  bb     unknown ""    ""   
#> 6 i            i          I     I      I       2019  bb     unknown ""    ""   
#> # ℹ 4 more variables: volume <chr>, number <chr>, abstract <chr>, isbn <chr>
#> 
#> $manual_dedup
#> # A tibble: 0 × 41
#> # ℹ 41 variables: author1 <chr>, author2 <chr>, author <dbl>, title1 <chr>,
#> #   title2 <chr>, title <dbl>, abstract1 <lgl>, abstract2 <lgl>,
#> #   abstract <dbl>, year1 <int>, year2 <int>, year <dbl>, number1 <lgl>,
#> #   number2 <lgl>, number <dbl>, pages1 <lgl>, pages2 <lgl>, pages <dbl>,
#> #   volume1 <lgl>, volume2 <lgl>, volume <dbl>, journal1 <chr>,
#> #   journal2 <chr>, journal <dbl>, isbn <dbl>, isbn1 <lgl>, isbn2 <lgl>,
#> #   doi1 <lgl>, doi2 <lgl>, doi <dbl>, record_id1 <chr>, record_id2 <chr>, …
kaitlynhair commented 10 months ago

I can't replicate this issue - can you test this again and let me know if it still has warnings? I believe min/max should work for strings / letters as well as numeric values.

LukasWallrich commented 10 months ago

Odd ... if I run this code, I still get the warning (after installing the current version). However, you are right that it is not about min/max - and not about non-numeric IDs ... instead, it always arises with this test data:

> tibble(
+     title = c(rep(LETTERS[1:3], 2), LETTERS[7:9]), author = c(rep(LETTERS[1:3], 2), LETTERS[7:9]),
+     journal = c(rep(LETTERS[1:3], 2), LETTERS[7:9]),
+     year = c(rep(2010:2012, 2), 2017:2019), source = c(rep("aa", 5), rep("bb", 4)),
+     record_id = 1:9, label = ""
+ ) %>% ASySD::dedup_citations()
Warning: The following columns are missing: doi, pages, volume, number, abstract, isbn
Are you sure you want to proceed? 

1: Yes
2: No

Selection: 1
formatting data...
identifying potential duplicates...
identified duplicates!
Joining with `by = join_by(record_id)`
flagging potential pairs for manual dedup...
Joining with `by = join_by(duplicate_id.x, duplicate_id.y)`
9 citations loaded...
3 duplicate citations removed...
6 unique citations remaining!
$unique
# A tibble: 6 × 14
  duplicate_id record_ids title author journal year  source label            doi   pages volume number abstract isbn 
  <chr>        <chr>      <chr> <chr>  <chr>   <chr> <chr>  <chr>            <chr> <chr> <chr>  <chr>  <chr>    <chr>
1 1            1, 4       A     A      A       2010  aa, aa unknown, unknown ""    ""    ""     ""     ""       ""   
2 2            2, 5       B     B      B       2011  aa, aa unknown, unknown ""    ""    ""     ""     ""       ""   
3 3            3, 6       C     C      C       2012  aa, bb unknown, unknown ""    ""    ""     ""     ""       ""   
4 7            7          G     G      G       2017  bb     unknown          ""    ""    ""     ""     ""       ""   
5 8            8          H     H      H       2018  bb     unknown          ""    ""    ""     ""     ""       ""   
6 9            9          I     I      I       2019  bb     unknown          ""    ""    ""     ""     ""       ""   

$manual_dedup
# A tibble: 0 × 41
# ℹ 41 variables: author1 <chr>, author2 <chr>, author <dbl>, title1 <chr>, title2 <chr>, title <dbl>, abstract1 <lgl>, abstract2 <lgl>,
#   abstract <dbl>, year1 <int>, year2 <int>, year <dbl>, number1 <lgl>, number2 <lgl>, number <dbl>, pages1 <lgl>, pages2 <lgl>, pages <dbl>,
#   volume1 <lgl>, volume2 <lgl>, volume <dbl>, journal1 <chr>, journal2 <chr>, journal <dbl>, isbn <dbl>, isbn1 <lgl>, isbn2 <lgl>, doi1 <lgl>,
#   doi2 <lgl>, doi <dbl>, record_id1 <chr>, record_id2 <chr>, label1 <chr>, label2 <chr>, source1 <chr>, source2 <chr>, duplicate_id.x <chr>,
#   duplicate_id.y <chr>, match <lgl>, min_id <chr>, max_id <chr>

Warning message:
There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `min_id = min(duplicate_id.x, duplicate_id.y)`.
Caused by warning in `min()`:
! no non-missing arguments, returning NA
ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning. 

I cannot provide a proper reprex because of Error in menu(c("Yes", "No"), title = paste("Are you sure you want to proceed?")): menu() cannot be used non-interactively - could that be changed to allow for automated use? E.g., just by saying if (interactive()) menu() else warning()?

LukasWallrich commented 10 months ago

This happens because maybe_pairs is empty - so it is the same as

tibble(a = character(0)) %>% mutate(min_a = min(a))

So you should probably check whether there are any maybe_pairs and exit the function prematurely if there aren't?

kaitlynhair commented 2 months ago

I think this has now been resolved - it now returns without errors for me.

To allow automated use, the user_input argument set to 1 allows you to bypass the menu - perhaps not the most obvious feature so might need to rename this.

LukasWallrich commented 1 month ago

Just checked and this is now solved - thanks