DCM unable to process new identifier structure

matthew-a-dunlap commented 6 years ago

Dataverse sends a dataset identifier to DCM for creation of the rsync script. With recent changes to the identifier, the value now being passed is something akin to "FK2/IPQXJP" instead of "IPQXJP" previously.

This causes an error in the DCM request queue: useradd: invalid user name 'FK2/IPQXJP'

It may be possible that the solution is on the DCM side, but I created an issue here in part because it blocks other Dataverse/DCM work (namely #4703)

Thanks to @pameyer for deeply diagnosing this issue as I ran across it.

connects to #4703

matthew-a-dunlap commented 6 years ago

I've moved this into development as I think the issue is small and needed for the other DCM work / knowledge sharing around DCM. I plan to start on it tomorrow (more or less continuing my DCM work in general)

pdurbin commented 6 years ago

My first thought is that perhaps we should send the database id rather than the identifier but I'll defer to @pameyer on this. There was a lot of confusion about the difference back in the day.

pameyer commented 6 years ago

The thinking for the identifier (aka - IPQXJP for a dataset w\ persistent identifier doi:10.5072/FK2/IPQXJP) was that it was unique to a given installation, could be used as a POSIX filesystem path, and didn't depend on Dataverse internals (which the database id could); I remember it causing some confusion before too (https://github.com/IQSS/dataverse/issues/3725#issuecomment-306604088).

I'm guessing that this was introduced with the changes for https://github.com/IQSS/dataverse/pull/4651 ; and https://github.com/IQSS/dataverse/pull/4728 didn't make me realize that the DCM APIs were also effected. Probably out of scope here, but the various incremental docker / DCM / IT improvements are getting closer to a point where we'll be able to find this type of thing sooner (along the lines of - https://github.com/IQSS/dataverse/issues/3915#issuecomment-318511298 and https://github.com/sbgrid/data-capture-module/blob/master/jenkins/Jenkinsfile).

matthew-a-dunlap commented 6 years ago

I have pushed a fix for this to https://github.com/sbgrid/data-capture-module/tree/fix_dvid_shoulder

I went with the approach of fixing this in DCM because two datasets can have the same doi after the shoulder but different shoulders. Even if practically that won't happen at this point we should probably update DCM to do a check on the full doi.

This fix could use some cleanup but is good enough to move forward with #4703 as we discuss its merits

pameyer commented 6 years ago

Unblocking things is good; but we should probably discuss if this is the best approach.

pameyer commented 6 years ago

For future reference, @scolapasta helped clear up some of my understanding of PID / shoulder / identifier, and @matthew-a-dunlap and I talked things over. We'll go with this approach, with a little more added error handling (roughly - don't fail if there's no / in the identifier; but fail if there are any characters that aren't legal in POSIX paths / usernames.

matthew-a-dunlap commented 6 years ago

I found an additional issue. With the new identifier naming structure (e.g. adding the shoulder) the help text for running the script has the / in the name which will not work

screen shot 2018-06-28 at 6 03 28 pm

The script itself gets generated as: upload-FK2_DTMDNT.bash . We may want to touch both places and update the naming.

pdurbin commented 6 years ago

I'll defer to @pameyer but I assume we don't want "FK2/" as part of the name of the script.

matthew-a-dunlap commented 6 years ago

@pdurbin the script itself does not have the /, something along the way converts it to an _ . But yea we should decide instead of letting different parts of the system do it in different ways.

pameyer commented 6 years ago

Ideally we wouldn't have the FK2, but as long as there are no special characters in it there shouldn't be a problem (the purpose is to make it easier for depositors to identify which script goes with which dataset, in case they are depositing several at the same time).

matthew-a-dunlap commented 6 years ago

I found another problem as I've gotten farther into the DCM process. post_upload.bash uses the name of the directory to recreate the doi to send back to dataverse. Dataverse is now sending the shoulder which currently my code is stripping the / out of when creating the directory (in usg.py).

Any code to add that "/" back in seems like it would be hacky. Maybe better to just strip off the shoulder when we first get it and leave it hardcoded in post_upload.bash as "FK2"? We can even check at the beginning if its not "FK2" and blow up early. Maybe not perfect but not any worse than before

I don't think there are any other parts of the code that use this, my bad for not running the code end-to-end earlier.

pameyer commented 6 years ago

@matthew-a-dunlap good catch!

I agree that filtering and re-adding "/" seems kind of hacky. One alternative might be to add it as an API (in ur.py) parameter that gets stored in the request file (thing-to-recreate-doi-to-talk-to-dataverse?), and use that in post_upload. It would be a little duplicative (unique thing stored in two places in the request file); but that seems like a possibly reasonable tradeoff.

pdurbin commented 6 years ago

Another alternative would be to use the database id of the dataset instead of part of the DOI or Handle (the "identifier" part). The database id is unique.

matthew-a-dunlap commented 6 years ago

@pdurbin Now that we are trying to massage this more I can see that being a better option. Especially if down the road DOIs/Handles start having other weird characters in them. Any downsides you are aware of the DB ID? We don't reuse the same values, right?

pdurbin commented 6 years ago

@pameyer has explained his objection to database IDs to me multiple times but it hasn't penetrated my thick skull. 😞

scolapasta commented 6 years ago

We'd rather not use db id's. As @pameyer points out they are meant to be internal and while we do expose them for convenience sake, they shouldn't be used in contexts like this (unless we find there's no other satisfactory way, but it would then still be considered a hack).

Another reason is we don't guarantee (and don't want to guarantee) db ids to be persistent. While they usually will be, that won't be the case when datasets are migrated (and this could happen, for example in the case where someone wants to split off their datasets into a new installation.

I'll discuss further with @matthew-a-dunlap @pameyer and others who are interested.

pameyer commented 6 years ago

Short summary from discussion - plan is to use scrubbed "identifiers" for DCM upload id, add persistent identifier to DCM API / request file, and fail cleanly if there's an edge case where the scrubbed identifier would break the DCM.

matthew-a-dunlap commented 6 years ago

I cut a Dataverse branch that captures just the work for this. https://github.com/IQSS/dataverse/compare/4761-dcm-new-pid-structure?expand=1

Most of the work was on the dcm side, captured here: https://github.com/sbgrid/data-capture-module/compare/fix_dvid_shoulder?expand=1

pameyer commented 6 years ago

Closing since https://github.com/IQSS/dataverse/pull/4946 is merged

IQSS / dataverse

DCM unable to process new identifier structure #4761