Document Object is required before uploading a file whereas Noark 5 requires a file at all times

ivaylomitrev commented 1 year ago

       Prosjekt  NOARK 5 Tjenestegresesnitt
       Kategori  Noark 5.5.0 TG versjon 1.0
    Alvorlighet  kommentar / protest
   Meldingstype  utelatt / trenger klargjøring
Brukerreferanse  user@example.com
    Dokumentdel  Chapter #6 (Document upload of files, both small and large)

Beskrivelse

As per the API specification (Chapter 6 document upload):

Et dokumentobjekt opprettes før opplasting. Hvis noen av feltene «format», «mimeType», «filnavn», «sjekksum», «sjekksumAlgoritme» og «filstoerrelse» er fylt inn ved opprettelsen skal tjeneren verifisere at verdiene i de angitte feltene stemmer når den komplette filen er lastet opp.

As per the Noark 5 standard (section 2.7 Dokumentbeskrivelse og dokumentobjekt), however:

Dokumentobjekt er det laveste metadatanivået i arkivstrukturen. Et dokumentobjekt skal referere til én og kun en dokumentfil.

Additionally, arkivstruktur.xsd (as shipped with the Noark 5.5 standard) defines:

(which implies minOccurs=0)

The metadatakatalog also identifies the field as "obligatory".

As a result, the API specification requires that archive cores allow "empty" document objects which might, however, lead to data quality issues (as a document may never be linked to said document object). Such empty document objects might also have to be "worked around" in implementations of the API specification as they cannot be returned as results of queries (them not being Noark-compliant in this intermediary state).

Please let me know if I have misinterpreted the specification or the standard.

Ønsket endring

Is it possible to allow documents to be uploaded prior to creating a document object. This way, the Noark 5 requirements will be met by the following flow:

Upload document
Create document object referencing the uploaded document

petterreinholdtsen commented 1 year ago

[ivaylomitrev]

Is it possible to allow documents to be uploaded prior to creating a document object. This way, the Noark 5 requirements will be met by the following flow:

Upload document

Create document object referencing the uploaded document

Nope, there is no such mechanism described in the API spesification. I personally believe uploading and creating a dokumentobjekt should be done in one step, as suggested in <URL: https://github.com/arkivverket/noark5-tjenestegrensesnitt-standard/issues/25 >, but it will have to wait. The change we got for the current version is in <URL: https://github.com/arkivverket/noark5-tjenestegrensesnitt-standard/commit/ba1e63e74beeb4e4496b8cf82298f2e4cd6406bb >.

-- Happy hacking Petter Reinholdtsen

ivaylomitrev commented 1 year ago

Thanks for the confirmation! Uploading and creating a document object will satisfy the Noark 5 requirements (and, thus, ours). I might have to follow up on this one in the near future as it does affect heavily our own implementation of the API.

By saying that it will have to wait, do you know if there's a ongoing process for fixing outstanding issues and releasing a new version of the API in the short-term?

petterreinholdtsen commented 1 year ago

[ivaylomitrev]

By saying that it will have to wait, do you know if there's a ongoing process for fixing outstanding issues and releasing a new version of the API in the short-term?

Not quite sure what your definition of short term is, but the editors are working on and off with the specification, and we have a editorial meeting planned in a few weeks. No idea if we will agree on wrapping up a new release any time soon, but I will at least argue that it is a good idea. :)

As always, patches and suggestions to improve the specification text are welcome. :)

-- Happy hacking Petter Reinholdtsen

ivaylomitrev commented 1 year ago

That might prove difficult with my (non-existent) Norwegian skills, but I will try where possible :)

petterreinholdtsen commented 1 year ago

[ivaylomitrev]

That might prove difficult with my (non-existent) Norwegian skills, but I will try where possible :)

I would be happy to set up a translation framework for the Noark 5 Tjenestegrensesnitt specification if someone want to translate the text to English, like I have done for the Noark 5 standard text. Let me know if someone want to translate it to English, and I can spent some time setting it up on <URL: https://hosted.weblate.org/ >. It is quite a lot of work to translate such texts, so I have not had motivation to start myself.

Regarding providing patches, if you can not draft texts, perhaps you can proof read proposals and provide insights to improve them? I'll try find time to draft a proposal for a unified file upload and dokumentobjekt creation in the next few days, for a future edition of the specification. Got some ideas how to do it in a backwards compatible way.

-- Happy hacking Petter Reinholdtsen

petterreinholdtsen commented 1 year ago

I guess it isi time to start discussing my old idea for uploading. The idea is to upload the file as early as possible in the archiving process, and then update metadata that the system have the option to derive from the uploaded file. An open question is how high up in the hierarcy it should be possible to do the upload, how to differenciate autodetected metadata values from manually entered/edited/checked metadata values, and how to return the list of automatically created archive entities to the client to allow the client to present the entities for validation and updates.

For example, it could be possible to upload a new document file into a file (mappe), and create the entries for registrering, dokumentbeskrivelse and dokumentobjekt automatically based on the content of the uploaded file. The same could be done by uploading into an existing registrering or dokumentbeskrivelse. It is just a question of how much information we want to ensure is created before the file upload. Perhaps it should be optional in the specification how "high" in the hieararcy it should be possible to upload a new file? What about container files like ZIP and TAR.GZ files? Uploading such container into Mappe or Registrering might create several dokumentbeskrivelse+dokumentobjekt entries, while doing it in Dokumentbeskrivelse might create only one dokumentobjekt entry.

Further, it is the question of what to return if several entities are created in one upload request. The result could either simply be the dokumentobjekt entry created (if only one is created), which will contain parent links that can be used to update the other generated entries. It can also be a dokumentobjekt entry with the created parent entries in a '_embedded' block according to the JSON Hypertext Application Language specification. Finally, it can be a list of different object types formatted like a search result in "results" attribute. All these options would be consistent with the current specification.

Finally, it is the question on how to handle automatically detected values, which in many cases should be manually checked by someone before the archive entities are considered finalized. For some file formats titles, authors, dates and other metadata can be extracted from the file, but for others there is no such metadata available. There need to be a generic way to handle attributes that not yet have a sensible value, to ensure they can be tracked down and updated manually if needed. Perhaps a "magic value" should be used to indicate that the current value is automatically generated? Perhaps if the string start with ASCII value 26 (Substitute), it can be seen as a marker that the archive client need to update the value and remove the ASCII 26 character? <URL: https://en.wikipedia.org/wiki/Substitute_character >

I am drafting a specification update, but do not yet know which path is the best way forward through this landscape. Would love input from other users and implementors of the specification.

-- Happy hacking Petter Reinholdtsen

ivaylomitrev commented 1 year ago

An open question is how high up in the hierarcy it should be possible to do the upload... [...] For example, it could be possible to upload a new document file into a file (mappe), and create the entries for registrering, dokumentbeskrivelse and dokumentobjekt automatically based on the content of the uploaded file... [...] Finally, it is the question on how to handle automatically detected values ...

My immediate reaction to this (emphasis on immediate) is that the API specification should not bother with such details. As long as it provides a simple generic way of uploading single documents that satisfy the requirements of the standard and, hopefully, all vendors, it should be sufficient. In other words, my impression is that the API specification should not overcomplicate the upload specification especially considering that there are already ways of creating resources such as dokumentbeskrivelse, dokumentobjekt, etc. I do not think new ways of creating these resources should be exposed as this would allow vendors to go with very specific (bordering with custom) interpretations of what metadata should be extracted from an uploaded (archive) file. Of course, the metadata for dokumentbeskrivelse/dokumentobjekt can be specified with multipart requests which would limit the amount of guesswork for the vendors, but uploading an arbitrary tarball would pose a lot of open questions to vendors as to the mapping of the data as such (arbitrary) multipart requests would be bothersome to build.

It seems to me that this particular point boils down to what the goal is, because I see two separate topics here - bulk upload/creation and single file uploads. I would say resolving the latter is more important at this point as it diverges from the requirements of the standard whereas the former can always be added in a backwards-compatible way to the specification, if a need arises for it. I would even argue that bulk upload should not be the topic of the API specification as long as it provides means of single file uploads. The argument here is that uploading a tarball may mean different things to different business systems (clients) and having a common vendor-specific processing of such tarballs may lead to more issues than it solves. Open topics off the top of my head are:

how would nested hierarchies be handled
what if a tarball contains multiple nested folders, but the upload target is a saksmappe which does not allow for such
what would provide information about mandatory primary klasses (and list values) in hierarchies that would require such
would each document in a tarball represent a dokumentbeskrivelse, or would they all represent a single dokumentbeskrivelse with multiple dokumentobjekter
what if one business system wants to upload a tarball as a tarball, but another wants it to be extracted and multiple entities to be generated from it
what limits would need to be imposed on archive files both in terms of size and contained documents... processing 5 files in a tarball is one thing, processing 1,000,000 is another regardless of the vendor's language/frameworks of choice. Also, would vendors be required to register all these in one transaction, meaning that a registration failure of the 999,999th document should revert the registration of all 999,998 before that?
etc.

Reading the questions I posed above, I am thinking that specifying bulk upload would either have to be extremely configurable to satisfy the requirements of various client-side business systems making it difficult to support by vendors or it would have to be very lenient allowing for a lot of interpretation and making it useless for clients.

I am not critiquing the idea in any way. I am convinced bulk upload would be a requirement by certain business systems/integrations. It is just my personal opinion that this best be left to the discretion of clients that are responsible for the business logic of the corresponding business system as either the API specification would have to be very limiting (posing issues for one or another existing vendor), or it would have to be very lenient (making the implementation very vendor-specific), or it would have to be overly configurable (making it difficult to implement and support both in terms of vendors and API specification).

EDIT: Of course, if there are actual requirements for bulk uploads by business systems/clients, maybe the best approach would be to gather such from them. Until such are available, I believe the bulk upload can go into too many directions and it might, unfortunately, be a guessing game as to what developers might need.

petterreinholdtsen commented 4 months ago

It occured to me that a way to ensure consistency and avoid dokumentobjekt instances without attached files to be visible to unsuspected consumers of the API, is to delay the attachment of the dokumentobjekt child entity to the dokumentbeskrivelse instance until the file is successfully uploaded.

When creating a dokumentobjekt instance, the instance with a _links dictionary including both a self link and the https://rel.arkivverket.no/noark5/v5/api/arkivstruktur/fil/ link is returned, so the program creating it can do the upload with the information available. But the _links dictionary in the parent dokumentbeskrivelse instance for the https://rel.arkivverket.no/noark5/v5/api/arkivstruktur/dokumentobjekt/ key do not need to be presented to API consumers before the file is uploaded.

This ensure programs uploading files can use the unfinished dokumentobjekt instance without causing consistency problem for any other API consumer.

As far as I can tell, this is both allowed by the Noark 5 specifiaction and the Noark 5 tjenestegrensesnitt specification, and would allow a implementation to provide consistent view without changing the current API description.

Note, I still would like to handle uploads directly from registrering, as proposed in #309. My point is that it is possible to avoid the problem described in this issue without any changes to the API specification.

-- Happy hacking Petter Reinholdtsen

ivaylomitrev commented 4 months ago

But the _links dictionary in the parent dokumentbeskrivelse instance for the https://rel.arkivverket.no/noark5/v5/api/arkivstruktur/dokumentobjekt/ key do not need to be presented to API consumers before the file is uploaded.

That would only be possible for dokumentbeskrivelse that had no dokumentobjekt instance in the first place. If the dokumentobjekt is being created in an existing dokumenbeksrivelse with multiple objects in it, the vendor would still need to return the dokumentobjekt key in the _links dictionary due to the presence of other dokumentobjekter.

As far as I can tell, this is both allowed by the Noark 5 specifiaction and the Noark 5 tjenestegrensesnitt specification, and would allow a implementation to provide consistent view without changing the current API description.

Would not that clash with the relasjoner requirements for dokumentobjekt in the specification that says that a dokumentobjekt must have a dokumentbeskrivelse:

petterreinholdtsen commented 4 months ago

[ivaylomitrev]

That would only be possible for dokumentbeskrivelse that had no dokumentobjekt instance in the first place. If the dokumentobjekt is being created in an existing dokumenbeksrivelse with multiple objects in it, the vendor would still need to return the dokumentobjekt key in the _links dictionary due to the presence of other dokumentobjekter.

Why would that be a requirement?

As far as I can tell, the only feature implementations need to have for this to work is a one way link between objects, which can be turned into a two way link when the file is uploaded. This can be done with a state variable or by keeping the dokumentobjekt entity in a holding area until it is ready to be hooked up to the rest of the data hierarcy. In other words, when dokumentbeskrivelse and dokumentobjekt is created:

[dokumentbeskrivelse] <---- [ dokumentobjekt]

And after the file is uploaded:

[dokumentbeskrivelse] <----> [ dokumentobjekt] ---> [uploaded file]

The dokumentobjekt _links list should point to its parent, but the parent should not point to its child without any uploaded file.

This can be implemented by the API by only handing out dokumentobjekt instances in the list returned behind the https://rel.arkivverket.no/noark5/v5/api/arkivstruktur/dokumentobjekt/ relation when a file attached, and only hand out the "empty" dokumentobjekt instance to the creator who used the https://rel.arkivverket.no/noark5/v5/api/arkivstruktur/ny-dokumentobjekt/ relation.

Would not that clash with the relasjoner requirements for dokumentobjekt in the specification that says that a dokumentobjekt must have a dokumentbeskrivelse:

Not really, as it do have a dokumentbeskrivelse. It is just not "commited" to the data structure before it has a file uploaded to it.

-- Happy hacking Petter Reinholdtsen

arkivverket / noark5-tjenestegrensesnitt-standard

Document Object is required before uploading a file whereas Noark 5 requires a file at all times #285

Beskrivelse

Ønsket endring