EDIorg / ECC

ECC = EML Congruence Checker
5 stars 0 forks source link

Possible new check: duplicate entity #10

Open mobb opened 6 years ago

mobb commented 6 years ago

I bumped into 2 datasets that appear identical; it seems that only the packageIds are different. https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-jrn.2100351001.45 https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-jrn.210351001.47

same entity, same md5 hash. Possibly, rev.47 was intended to be a metadata-only update of rev 45. The diff is below. I don't know if there is any way to trap this; possibly by checking the md5 hash regardless, and alerting the user that "hey, this entity is already assoicated with a packageID, do you really want to upload it again?" But maybe that is too much hand-holding. I contacted the site IM and let him know.

Here is the diff:

bash-3.2$ diff  jrn_21000351001_45.xml jrn_210351001_47.xml 
2c2
< <eml:eml packageId="knb-lter-jrn.2100351001.45"
---
> <eml:eml packageId="knb-lter-jrn.210351001.47"
21c21
<     <alternateIdentifier system="https://doi.org">doi:10.6073/pasta/436280872d7e64504a2fa704190afd12</alternateIdentifier>
---
>     <alternateIdentifier system="https://doi.org">doi:10.6073/pasta/b5eb273946c5ca712f841c61ede7eef4</alternateIdentifier>
49a50,53
>       <keyword>LTAR</keyword>
>       <keywordThesaurus>Research Networks</keywordThesaurus>
>     </keywordSet>
>     <keywordSet>
64c68
<       <keywordThesaurus>LTER VI Proposal Category</keywordThesaurus>
---
>       <keywordThesaurus>Research Area</keywordThesaurus>
81c85
<         <url function="information">https://lter.jornada.nmsu.edu/content/jornada-experimental-range-permanent-quadrat-chart-data-beginning-1915-plant-cover</url>
---
>         <url function="information">https://jornada.nmsu.edu/content/jornada-experimental-range-permanent-quadrat-chart-data-beginning-1915-plant-cover</url>
141c145
<     <pubPlace>Jornada Basin LTER</pubPlace>
---
>     <pubPlace>Jornada</pubPlace>
208c212
<             <url>https://pasta.lternet.edu/package/data/eml/knb-lter-jrn/2100351001/45/10c1bf759b5700581368e64387c2a347</url>
---
>             <url>https://pasta.lternet.edu/package/data/eml/knb-lter-jrn/210351001/47/10c1bf759b5700581368e64387c2a347</url>
cgries commented 6 years ago

Hmm, but isn’t happening frequently and that is why Duane implemented the option to reuse the same tables when the MD5 hasn’t changed. Why are you even looking at the older version?

mobb commented 6 years ago

The reason I'm seeing both of these is that they turned up side-by-side in a search, but I'm betting that they shouldn't have. This situation would not have been caught by the md5 check, but really - it reflects a flaw in a local system (keeping track of packageids), so is probably out of scope.

mobb commented 6 years ago

This is a housekeeping problem for the site and not the only one of their datasets affected. But since they know about it, it's not a good use case for a check.

srearl commented 6 years ago

A housekeeping problem for the site for sure but I wonder if a check that would help sites/data-set-submitters avoid such issues might be worth considering given that, once submitted, it is no longer just a local problem but now a community problem. Data packages cannot be deleted so now there are these duplicates, which is not the end-of-the-world but also not a good situation.

mobb commented 6 years ago

There may be a check of some form, e.g., a warn that "this exact entity is associated with another EML record!"

So, OK - we can leave it! thanks @srearl

BTW, the solution (to the duplicate) is still technically called a"delete" although datasets are only "deleted-from-index" (they are not actually deleted, only archived).

the process needs to get written up -- it falls into the BPs-for-working-with-pasta, so I guess it's mine, and belongs here: https://github.com/EDIorg/dm-best-practices The site has to be the one to make the linkage between a deprecated dataset A and it's replacement B. that is the main housekeeping task. We have a couple of these to for SBC, and am collecting a list of reasons why B is likely to be deprecating A.

mobb commented 6 years ago

Here is legitimate reason for one table to show up in multiple datasets: the table is a species list, and the submitter has multiple datasets that use the same species table. There are alternatives (e.g., create a unique dataset for the species list table), but packaging them together can be more convenient. So a check that looks for duplicate entities should not return a warn, because we should not imply it's not a legitimate thing to do.

srearl commented 6 years ago

Yeah, but @mobb is that not a different issue? Having the same data entities in different data packages I do not think is a problem. In fact, I do that often and purposefully, for example when the same spatial data apply to data in different data packages. This issue here, I thought, was concerning duplicate data packages (not data entities across packages).

mobb commented 6 years ago

This issue describes a check for duplicate entities with entity checksums. duplicate packages would be harder. And yes, the example that started this is actually a duplicate dataset.

srearl commented 6 years ago

I think the duplicate data set (not duplicate data entities in different packages) is worth considering. Maybe as a new issue but either way.