EOGrady21 / vprr

Video Plankton Recorder Data Processing
https://eogrady21.github.io/vprr/
Other
2 stars 1 forks source link

vpr_autoid_create - generalize to avoid confusion among category names #38

Closed kevinsorochan closed 11 months ago

kevinsorochan commented 1 year ago

Is your feature request related to a problem? Please describe.

Currently vpr_autoid_create() will produce and error if there is repetition in the name among categories. For example "other" and "other_copepod"

Describe the solution you'd like Currently, the solution is to look for the the pattern "other". This causes the problem. A more general solution would be preferred. Perhaps by selecting an index among names rather than searching for a pattern.

Describe alternatives you've considered See above

Additional context Add any other context or screenshots about the feature request here.

kevinsorochan commented 1 year ago

The same problem occurs later in the function:

recl <- grep(reclassify, pattern = taxa)

kevinsorochan commented 1 year ago

After some consideration, I think the best way to deal with this problem could be to count characters. It is only the shortest category name that will be confused. (e.g., "other" will be confused with other_copepod, but not vice-versa). So one solution is to find the element with the shortest character count and chose that one.

reclassify_taxa <- grep(reclassify, pattern = taxa, value = TRUE)

if(length(reclassify_taxa) > 1) {

    confused_recl <- reclassify_taxa
    nchar_cr <- nchar(confused_recl)
    min_nchar <- min(nchar_cr)
    recl_idx <- which(nchar_cr == min_nchar)
    reclassify_taxa <- reclassify_taxa[recl]

  }
EOGrady21 commented 1 year ago

This hardcoding problem actually occurs within vpr_category(). I agree it should be updated.

A short term solution would be to update how the regular expression pulls the taxa names in vpr_category(), right now it just searches for the taxa name (m_tmp <- gregexpr(taxa_id, x)) but it could be updated to be more specific that the start (^) and end ($) of the string should match (this would remove the duplication issue you are experiencing. )

In the longer term, this vpr_category() function should be reworked to avoid the hardcoding.

kevinsorochan commented 1 year ago

The problem that I am specifying here does not have to do with the hardcoding in vpr_category() explicitly (issue #37). Perhaps the fix for this issue can be made in vpr_category() though.

EOGrady21 commented 11 months ago

This bug fix should be tested on a barebones dataset if possible, will leave the issue open until then.

EOGrady21 commented 11 months ago

Updated and tested