Open jeanetteclark opened 4 years ago
If we really wanted to be thorough, we could also add a url
with the attribute download
pointing to the coordinating node URL for the object itself.
Bump. This should be a priority, @mbjones can follow up with other thoughts
The distribution URL should be the view URL if the dataset has a UUID, otherwise the DOI url if the dataset as a DOI.
Update:
The soon-to-be-released develop
branch has a number of updates that support this feature, including:
Further work on this feature continues in the feature-1380-auto-add-dist-url
branch. See https://github.com/NCEAS/metacatui/commit/b80f8e05bc9666d29a6e320d39333d15b82d847b.
Remaining work to be done:
The second task will involve changes to MetadataView.publish
method. Instead of using the DataONE publish API, we will need to first generate the DOI, then update the EML with new <distribution>
URL, then save the new record. IDs can be generated using baseURL/generate/
, see the R package for an example.
I am pushing this feature from the upcoming release to the next one for now.
Here is what happens in the Editor with the updates currently in the feature-1380-auto-add-dist-url
branch:
autoAddDistributionURL
option in the app config.<distribution>
element is considered an old distribution URL if ALL the following are true:<online>
child element<online>
has a child <url>
element<url>
element has a function
attribute set to "information"<url>
value contains the dataset's new PID, old PID, or seriesId, whether it is url encoded or not.<distribution><online><url function="information">{DOI or VIEW URL}</url></online></distribution>
flowchart TD
A[press SAVE button]
C(`autoAddDistributionURL`?)
D[Remove old distribution URLs]
F[Add new distribution URL]
G[EML is serialized as normal]
subgraph "For Each <distribution> Node"
Z[start checking node]
Y(Has < online> child?)
X(Has child < url> element?)
W(< url> has 'function=information'?)
V(< url> contains new PID, old PID, or seriesId?)
U[Remove it]
T[Keep it]
S[done checking node]
Z --> Y
Y -- Yes --> X
X -- Yes --> W
W -- Yes --> V
V -- Yes --> U
Y -- No --> T
X -- No --> T
W -- No --> T
V -- No --> T
S -. next node .-> Z
end
A --> C
C -- Yes --> D
D --> Z
F --> G
C -- No --> G
U --> S
T --> S
S --> F
I've been working on how the publish DOI button will need to work in order to keep the online distribution information up-to-date. The notes below show my preliminary ideas... Feedback is very welcome!
How the Publish With DOI button works now:
/publish
end point. The request includes the PID of the current EML document.Proposed new behaviour for the Publish With DOI button:
/generate
endpoint. No PID is included in the request, it's just a reserved DOI at this point.<distribution>
elements that give the online distribution url of the dataset are updated with the new DOI url.sequenceDiagram
participant U as User (Browser)
participant S as DataONE (Server)
U->>U: Click Publish button
activate U
U->>S: Request new DOI via /generate/ endpoint
deactivate U
activate S
S-->>U: Return new reserved DOI
activate U
deactivate S
U->>S: Request EML doc
deactivate U
activate S
S-->>U: Send EML doc
activate U
deactivate S
U->>S: Request resource map
deactivate U
activate S
S-->>U: Send resource map
activate U
deactivate S
U->>U: Parse EML & resource map
U->>U: Update EML doc with new DOI
U->>U: Update resource map with DOI
U->>S: Save EML with new DOI
deactivate U
activate S
S-->>U: Success
activate U
deactivate S
U->>S: Save resource map new PID
deactivate U
activate S
S-->>U: Success
activate U
deactivate S
U->>U: Redirect to new view URL with the DOI
deactivate U
Overall looks great @robyngit. The other thing the /publish
endpoint does is changes access control to make the whole package, including all metadata/ore/datafiles publicly readable if they are not already. This is because we have a policy that data with a DOI are public. Can you add that to your list, and review the /publish
implementation to be sure we're not missing something else?
The publish
method is implemented by Metacat in MNNodeService
. Given an identifier and a session, it takes the following steps:
Resolve SID to PID: Using the method getPIDForSID
, it checks whether the original ID is actually a Series ID and resolves it to a PID if necessary.
Fetch Metadata: Retrieves the system metadata (and Science Metadata?) of the dataset using the getSystemMetadata
method.
Mint New Identifier: Generates a new identifier (DOI) for the new version of the dataset using the generateIdentifier
method.
Update Metadata: Modifies the new System Metadata to reference the new identifier and to mark the original identifier as obsoleted.
Make Metadata Public: If the original metadata isn't publicly accessible, the makePublicIfNot
method ensures that the new metadata is made publicly readable.
Update or Edit Metadata: If the original dataset is a science metadata document (e.g., in EML format), it updates the metadata with the new identifier.
Object Update: Finally, it calls the update method to persist these changes.
Update Resource Map: (optionally?) Updates the resource map (ORE) that describes the relationships between the metadata and any accompanying data. It does so either by finding an existing resource map and updating it or by generating a new one if an existing one is not found. Specifically is:
potentialOreIdentifier
). If it doesn't find it, it tries to get the newest resource map for the original identifier (originalIdentifier
) from SOLR.ResourceMapModifier
to replace the identifier of the original metadata with the new DOI.SystemMetadata
) of the existing resource map is copied, and some of its properties are updated, such as setting new identifiers, checksums, and size.Return New Identifier: The method returns the new identifier (DOI) that was minted for the updated science metadata.
Notes:
ServiceFailure
exception with an error code of 1030
.@taojing2002 - Is this summary correct, and am I missing anything here?
@mbjones and everybody - We decided that MetacatUI should implement the above steps rather than using the publish
endpoint, in order to support automatically adding the online distribution URL to EML docs. Given the complexity of the task, I'm wondering whether it might make more sense to update the Metacat implementation instead? The publish
method already parses and updates the EML with the editScienceMetadata
method. From what I can tell, all it would require is to extended the method to also update old distribution URLs. I imagine that this would be a useful function for Metacat to perform generally? Thoughts?
Great summary, thanks @robyngit The reason I have been pushing you to reimplement in MetacatUI is that we've generally had the policy that Metacat never changes content -- it always just takes instructions from API clients on what to change. the publish
method violated that principle, but leaves much to be desired. While it updates some fields within the EML, it does not properly update all of the areas of EML that should be updated. Plus, it doesn't support other metadata standards (like ISO), and so it doesn't conform to the metadata-standard-agnostic API we've had with Metacat. Also, if we release new versions of EML, the publish method would need to be updated to reflect those versions. I'd really like to keep the principle that changing metadata is a client responsibility, and validating metadata and storing it is a Metacat responsibility. But let's discuss -- expediency got us to where we are today, and sometimes we need to take the faster route.
Oh, and to add one more thing. Because MetacatUI already has the metadata parsed and methods for creating and publishing new versions, I feel like the client-side implementation of this would be modifying a few things on a well-worn trail:
Maybe on wrong in whether existing MetacatUI code already does all of this. So let's discuss if that is the case.
The summary looks great!Sent from my iPhoneOn Aug 22, 2023, at 3:26 PM, Robyn @.***> wrote: The publish method is implemented by Metacat in MNNodeService. Given an identifier and a session, it takes the following steps:
Resolve SID to PID: Using the method getPIDForSID, it checks whether the original ID is actually a Series ID and resolves it to a PID if necessary.
Fetch Metadata: Retrieves the system metadata (and Science Metadata?) of the dataset using the getSystemMetadata method.
Mint New Identifier: Generates a new identifier (DOI) for the new version of the dataset using the generateIdentifier method.
Update Metadata: Modifies the new System Metadata to reference the new identifier and to mark the original identifier as obsoleted.
Make Metadata Public: If the original metadata isn't publicly accessible, the makePublicIfNot method ensures that the new metadata is made publicly readable.
Update or Edit Metadata: If the original dataset is a science metadata document (e.g., in EML format), it updates the metadata with the new identifier.
Object Update: Finally, it calls the update method to persist these changes.
Update Resource Map: (optionally?) Updates the resource map (ORE) that describes the relationships between the metadata and any accompanying data. It does so either by finding an existing resource map and updating it or by generating a new one if an existing one is not found. Specifically is:
Finds existing resource map: First, the code attempts to find the existing resource map based on a specific naming convention (potentialOreIdentifier). If it doesn't find it, it tries to get the newest resource map for the original identifier (originalIdentifier) from SOLR. Modifies resource map: The existing resource map is modified using ResourceMapModifier to replace the identifier of the original metadata with the new DOI. Prepares new resource map System Metadata: The System Metadata (SystemMetadata) of the existing resource map is copied, and some of its properties are updated, such as setting new identifiers, checksums, and size. Makes resource map public: The new resource map System Metadata is made publicly readable. Updates or creates resource map: Finally, the code updates the existing resource map with the modified one or, in some scenarios, creates a new resource map if one does not exist.
Return New Identifier: The method returns the new identifier (DOI) that was minted for the updated science metadata.
Notes:
If an exception is caught, it gets wrapped in a ServiceFailure exception with an error code of 1030 .
Questions @taojing2002 - Is this summary correct, and am I missing anything here? @mbjones and everybody - We decided that MetacatUI should implement the above steps rather than using the publish endpoint, in order to support automatically adding the online distribution URL to EML docs. Given the complexity of the task, I'm wondering whether it might make more sense to update the Metacat implementation instead? The publish method already parses and updates the EML with the editScienceMetadata method. From what I can tell, all it would require is to extended the method to also update old distribution URLs. I imagine that this would be a useful function for Metacat to perform generally? Thoughts?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
Thank you for explaining that, @mbjones! I can see the rationale behind keeping the roles of storage/validation and metadata manipulation separate. I agree it makes sense to implement the new publish behaviour in MetacatUI.
Currently in MetacatUI, some of the code the handles data package management that we need for the publish behaviour is entangled in the EML211EditorView
. There's also some model logic in the MetadataView
. The new behaviour that we've outlined here belongs in a model, not added on the MetadataView
. To proceed, I would like to move code out of these two views and into either the existing DataPackage
collection, or into a new model (DataPackageManager
?). The new publish
method could be added to this model and used by the MetadataView
's publish button.
This change is not strictly necessary for the new publish behaviour, but I think it would be a good idea for a few reasons:
DataPackage
model itself is already quite large and complex (with parts that perhaps should be refactored into separate models).DataPackage
collection and MetadataView
? If so, I think it would be best to proceed with this feature after the package table work is merged in the develop.Hi @robyngit,
Thank you for checking.
Regarding changes related to the hierarchical package table work, the DataPackage
collection does not have major changes. It includes some additional methods for parsing and storing atLocation
information and nested
package info.
However, I believe there are quite a few changes with the MetadataView
. A lot of functionality related to the Package Table has been refactored and/or moved to other views, such as DataPackageView
, DataItemView
, etc.
That makes sense @rushirajnenuji, thanks! I'll put this issue on hold for now, and continue later when this is merged in develop.
Some quick thoughts on lifecycle representation in the app for discussion...
stateDiagram-v2
[*] --> Draft: New
Draft --> Draft: Save
Draft --> [*]: Delete
review : In review
rev_request: Review requested
Draft --> rev_request: Review
rev_request --> review: StartReview
review --> Draft: RequestRevision
review --> Approved: Approve
Approved --> Published: Publish
Published --> Draft: Edit
Published --> [*]
@mbjones, thanks for this diagram! I opened #2205 as a place to continue the discussion on the publishing workflow, and included your comment there as well. This way, we can dive deeper into the workflow discussion without losing track of the specifics here.
Let's keep this particular issue focused on: 1) Automatically adding the distribution URL & 2) Moving the functionality that's currently in the MNNodeService publish method to MetacatUI, in support of the first point.
Describe the feature you'd like
I would like for metacatUI to automatically add a distribution URL to the
dataset
element.Is your feature request related to a problem? Please describe.
This is related to a check in the metadig FAIR suite, which looks for whether a resource landing page is present. The check looks at this XPATH
/eml/dataset/distribution/online/url[@function="information"] |
.Additional context
Here is an example:
The base URL should be the view service for whatever member node is being used, or if the dataset has DOI, it should use doi.org