DataONEorg / rdataone

R package for reading and writing data at DataONE data repositories
http://doi.org/10.5063/F1M61H5X
36 stars 19 forks source link

Duplicates in listObjects #296

Closed ThomasThelen closed 2 years ago

ThomasThelen commented 2 years ago

I may have ran into a bug while searching for EML documents with listObjects. I'm noticing a large number (500+) of duplicates in the search results.

To Reproduce: Start by running...

cn <- CNode("PROD")
objects <- listObjects(cn)

Then, inspect the objects variable in the Environment tab. Next, open a few of them up and you should see that some are repeated. Finally, Scroll to the bottom and open the last object; in my case it matches the objects found in the previous step (so this object appears to be repeated at least 1,000 times).

They can be diffed by either resolving the objects (they should both end up in the same place) or by copying an object, pasting it in a text editor, and then copying a second object and using ctrl+f to confirm that they're the same (the first object's text should light up).

Start Behavior By offsetting the results, it's possible to find objects that differ from each other (but still seeing at least 1,000 duplicates of them).

To Reproduce:

Start by running

cn <- CNode("PROD")
format <- "eml://ecoinformatics.org/eml-2.0.1"
objects <- listObjects(cn, formatId=format)

Take note of the first object's identifier.

Then run

cn <- CNode("PROD")
format <- "eml://ecoinformatics.org/eml-2.0.1"
objects <- listObjects(cn, formatId=format, start=1000)

Open one of the objects, and the identifier should be different than the first.

Screenshot: The screenshot below is showing that there are duplicates of doi:10.6085/AA/ICMDXX_XXXITV2XMSR01_20170101.50.1

Screen Shot 2022-04-28 at 6 05 07 PM

My original intent was to use the following to find particular versions of EML documents. In this case, the formatId is still respected, but the same EML document will be shown multiple times.

cn <- CNode("PROD")
format <- "eml://ecoinformatics.org/eml-2.0.1"
objects <- listObjects(cn, formatId=format)
amoeba commented 2 years ago

I can reproduce this but I don't think it's a bug in rdataone or the DataONE API. I think it's this bug in the RStudio Viewer Pane.

Can you try checking your results in another way and also checking the direct call to the API? Here's an example of how to do that:

cn <- CNode("PROD")
objs_withformat <- listObjects(cn, formatId="eml://ecoinformatics.org/eml-2.0.1")
ids <- lapply(r1, function(o) { tryCatch(o$identifier, error = function(e) { e }) })
any(duplicated(ids)) # Should return FALSE
ThomasThelen commented 2 years ago

Aha! Went through your testing step and looks like it's the RStudio bug-thanks. I'll let you decide the fate of this issue.