CDLUC3 / mrt-doc

Documentation and Information regarding the Merritt repository
8 stars 4 forks source link

File names generated via Nuxeo harvest include unneeded characters #1655

Open elopatin-uc3 opened 9 months ago

elopatin-uc3 commented 9 months ago
It seems like the harvester is deriving the filename from an href value in the feed, instead of using the dc:title field. 

For example:

<link href="https://nuxeo.cdlib.org/Nuxeo/nxfile/default/273d436f-7036-4354-a1f8-0ba5c7855a2b/file:content/MS-F044_accn2022_005_bag.zip?changeToken=4-0" rel="alternate" title="Main content file">
      <opensearch:checksum algorithm="MD5">210648d666d520d81a9c5725aed79106</opensearch:checksum>
    </link>
    <dc:creator>Parham, Thomas A.</dc:creator>
    <dc:title>MS-F044_accn2022_005_bag</dc:title>

Note changeToken=4-0 at the end of the zip file URL

Unfortunately the .zip extension is not present in dc:title
elopatin-uc3 commented 9 months ago

@elopatin-uc3 to file separate issue for a different Nuxeo harvest file name issue where ARK is included and file extension is excluded.

https://github.com/CDLUC3/mrt-doc-private/issues/66

terrywbrady commented 8 months ago

We will investigate if the compose plugin can be installed on linux

dloy commented 8 months ago

Comment: The S3 key used by these files includes the query portion of the URL.

<key>ark:/13030/m5r89gbn|1|producer/nuxeo.cdlib.org/Nuxeo/nxfile/default/cede1001-e4f1-4223-9a34-243da2296bcd/file:content/highlander_19731004_027.tif?changeToken=6-0</key>

Doing an ingest change on the pathname handling to remove the URL property will generate content under the correct pathname/key but original key and content will remain with the earlier version.

elopatin-uc3 commented 8 months ago

@elopatin-uc3 should set up a separate meeting to discuss. Consider inviting AT for Nuxeo details.

elopatin-uc3 commented 7 months ago

Discussed on 1/12. Subsequent meeting to be scheduled to talk about possible solutions. Initial work should include fixing the harvester so it disregards URL parameters and excludes them from file names; also should consider no longer using Add, but Update endpoint instead.

mreyescdl commented 5 months ago

Work is being scheduled for eliminating the change token data currently in S3. In tandem, we will need to eliminate the creation of new data with query parameters.

Will modify Nuxeo client to eliminate any query parameter in filename.

mreyescdl commented 5 months ago

Nuxeo client change to eliminate changeToken https://github.com/CDLUC3/mrt-atom/pull/8