Request to add more media relationships (current loader maxes out at 5 relationships)

ebraker commented 3 years ago

I'm trying to link bird records to individually scanned ledger pages hosted online by our library. The current media metadata bulkloader maxes out at 5 relationships, however, some of our ledger pages relate to 43 cataloged items, though most are in the ballpark of 12 entries (see https://ark.colorado.edu/ark:/47540/j59s8741819t). In an ideal world, a media record that I create would relate to all specimens recorded on a scanned ledger page (as "documents cataloged item" relationships).

Current error:

I've prepared the attached bulkloader. Any way to push it in or augment the metadata handler so that I can load it myself?

bird_ledger_media_load_2020-12-21.zip

FYI there is more of this coming down the pipeline, as the libraries have scanned all of our vertebrate ledgers. I still have 22 ledgers-worth of media to load, representing roughly 90K specimens (the library doesn't have all of this indexed yet, but it will be ~5000 more media records, representing individually scanned pages from each ledger, with 5-40+ "documents cataloged item" relationships per page).

dustymc commented 3 years ago

This:

should handle any number of relationships

ebraker commented 3 years ago

Ok thanks. Oof, I didn't realize there was a separate process...I will look for a somewhat efficient way to transform 100+ columns into one to create a bulkload media relationships file.

Will the media_IDs for the 850 media records I am currently creating (via bulkload media metadata) show up in the My Stuff table so I can at least grab and download them somewhat easily to put in this second bulkloader?

campmlc commented 3 years ago

Let me know if you find that efficient way to transform columns. We have way too many tools that require swapping between those two formats, and this should just not happen as transforming data in that way will inevitably lead to error for the average user.

On Tue, Dec 22, 2020, 2:00 PM Emily Braker notifications@github.com wrote:

[EXTERNAL]*

Ok thanks. Oof, I didn't realize there was a separate process...I will look for a somewhat efficient way to transform 100+ columns into one to create a bulkload media relationships file.

Will the media_IDs for the 850 media records I am currently creating (via bulkload media metadata) show up in the My Stuff table so I can at least grab and download them somewhat easily to put in this second bulkloader?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3314#issuecomment-749773483, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBD43VNYFRFFDOAY4L3SWECE5ANCNFSM4VEWHYBQ .

ebraker commented 3 years ago

@dustymc - I'm getting an "invalid" error in my Bulkload Media Metadata status column for all of my preview URIs. Any ideas as to why? They don't seem all that large (~22KB)...

Here's a couple examples: https://cudl.colorado.edu/MediaManager/srvr?mediafile=/Size1/CUB~8~8/4692/musm_zooLed_birds_v5_13734-13775.jpg https://cudl.colorado.edu/MediaManager/srvr?mediafile=/Size1/CUB~8~8/4692/musm_zooLed_birds_v5_13987-14028.jpg

Jegelewicz commented 3 years ago

https://handbook.arctosdb.org/how_to/How-to-Create-Media-Images.html#preview-uri

Preview filesize should be well under 10K and scale to ~1200x, previews larger than 48K will NOT be displayed. TIP: If you have difficulty creating an appropriately sized thumbnail you might try http://makethumbnails.com/#dropzone

ebraker commented 3 years ago

Ok, I'm still getting the invalid error and the preview links are now ~5.7 KB (96x79 pixels)...

Example: https://cudl.colorado.edu/MediaManager/srvr?mediafile=/Size0/CUB~8~8/4692/musm_zooLed_birds_v5_13944-13986.jpg https://cudl.colorado.edu/MediaManager/srvr?mediafile=/Size0/CUB~8~8/4692/musm_zooLed_birds_v5_13692-13733.jpg

Jegelewicz commented 3 years ago

That's beyond me. @dustymc

dustymc commented 3 years ago

Here's the test:

<cfset pf_puri="https://cudl.colorado.edu/MediaManager/srvr?mediafile=/Size1/CUB~8~8/4692/musm_zooLed_birds_v5_13734-13775.jpg">
<cfhttp url="#pf_puri#" charset="utf-8" method="head" />
<cfdump var=#cfhttp#>

and the result:

For whatever reason, those URIs are returning 404 to Arctos.

curl is returning something different, but still nothing that Arctos would accept as valid.

curl -I https://cudl.colorado.edu/MediaManager/srvr?mediafile=/Size1/CUB~8~8/4692/musm_zooLed_birds_v5_13734-13775.jpg
HTTP/1.1 403 403
Date: Tue, 29 Dec 2020 18:33:08 GMT
Server: Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.2k-fips
Strict-Transport-Security: max-age=15768000
Content-disposition: filename="musm_zooLed_birds_v5_13734-13775.jpg"
Content-Type: text/html;charset=utf-8
Content-Language: en
Content-Length: 22244

ebraker commented 3 years ago

So what does that mean? The preview URLs work when I paste them in my browser...

https://cudl.colorado.edu/MediaManager/srvr?mediafile=/Size0/CUB~8~8/4692/musm_zooLed_birds_v5_13692-13733.jpg

dustymc commented 3 years ago

I've set up various things to prevent situations that users will interpret as "Arctos is broken." This one is intended to prevent things like trying to use your local drive to host the previews, but it does so by making a HEAD request which that server doesn't like for some reason. There's nothing terribly fatal about dropping that and just accepting whatever you type in; I'd expect it to result in more broken links and such, but it could be argued that that's your problem, not something Arctos needs to deal with.

But, there's a similar check before the thumbs are displayed, and I don't think it would be so trivial to bypass without actually breaking Arctos.

I'm up for about anything, I don't have any great suggestions other than asking them to fix their server.

ebraker commented 3 years ago

Ok, ah well. To keep it simple I ended up just loading a generic image of a catalog cover for all the ledger page media.

ebraker commented 3 years ago

I have a csv with 12,000 media relationships to add. I loaded a test of 130 and received the following error:

An error occurred while processing this page! Message: Error invoking external process Detail: psql:/usr/local/webroot/temp/excopy_ebraker_20210109040102655_522.sql:130: ERROR: invalid input syntax for type bigint: "UCM:Bird:13944" CONTEXT: COPY cf_temp_media_relations_ldr, line 2, column related_key: "UCM:Bird:13944"

It seems as though the bulk media relationships loader does not like GUIDs as related_keys?

media_relationships_test.zip

dustymc commented 3 years ago

No the key is an integer, the loader doesn't yet support what you're trying to do.

I can get the key for you if you want to send a CSV of your GUIDs, or I should be able to add a guid-handler to the loader in the next few days (or both).

ebraker commented 3 years ago

Thanks! I'm fine with either option, though I would imagine a GUID handler is more widely applicable to users.

csv attached just in case (though I still need to figure out how to semi-gracefully transform this dataset into three columns....) bird_ledger_media_relationships_2020-12-22_v2.zip

dustymc commented 3 years ago

I'll try to get a guid/string handler patched in in the next few days. I did this - results attached.

create table temp_m as select * from temp_cache.temp_dlm_uptbl ;

create table temp_m2 (
  media_id bigint,
  related_term varchar,
  relationship varchar
);

CREATE OR REPLACE function tttemp() returns void AS $body$
declare 
    c bigint;
    mtt varchar;
    mrt varchar;
    s varchar;
BEGIN
    for c in 1..43 loop
      mtt='media_related_term_'||c;
      mrt='media_relationship_'||c;
      s:='insert into temp_m2 (media_id,related_term,relationship) (select media_id::bigint,' || mtt|| ',' || mrt || ' from temp_m)';
      execute s;
    end loop;
end;
$body$
LANGUAGE PLPGSQL
SECURITY DEFINER
 volatile;

select tttemp();
delete from temp_m2 where related_term is null;
alter table temp_m2 add related_key bigint;
update temp_m2 set related_key=(select collection_object_id from flat where flat.guid=temp_m2.related_term);

temp_m2.csv.zip

ebraker commented 3 years ago

@dustymc sweet relief!!! Thank you so much for reformatting the spreadsheet! Magic indeed.

dustymc commented 3 years ago

Next release has string-resolution for cataloged_item (and a mechanism to add it for other relationships).

ebraker commented 2 years ago

@dustymc can you do your transform magic on this giant spreadsheet like the last go-around? It's another media relationships bulkload - this time for the UCM herp ledgers (all GUIDs with "documents cataloged item" relationships to the media ID). I've already created media for the ledger pages in Arctos and added in the media keys to this spreadsheet, I just need the GUIDs transformed from columns to rows... herp_GUIDs_mediaIDs.csv

dustymc commented 2 years ago

Do you have some fancy magic that'll turn the headers into something my tools can deal with? arctos_guid_4 would be nice. If not I'll find a way, but maybe that's 5 seconds in whatever you used to get here??

ebraker commented 2 years ago

No problem! Here you go: herp_GUIDs_mediaIDs_2.csv

dustymc commented 2 years ago

Here's data and SQL. I'm a little dizzy from all the looping, but thinking of https://github.com/ArctosDB/arctos/issues/5193 I am excited to learn that there really are 500-column spreadsheets in the wild!

drop table temp_m;
create table temp_m as select * from temp_cache.temp_dlm_uptbl ;
drop table temp_m2;
create table temp_m2 (
  media_id bigint,
  related_term varchar
);
CREATE OR REPLACE function tttemp() returns void AS $body$
declare 
    c bigint;
    mtt varchar;
    mrt varchar;
    s varchar;
    r record;
BEGIN
  for r in (select * from temp_m) loop
    for c in 1..437 loop
      mtt='arctos_guid_'||c;
      s:='insert into temp_m2 (media_id,related_term) (select media_id::bigint,' || mtt|| ' from temp_m where media_id::int=' || r.media_id || ')';
      execute s;
    end loop;
  end loop;
end;
$body$
LANGUAGE PLPGSQL
SECURITY DEFINER
 volatile;
select tttemp();
delete from temp_m2 where related_term is null;
DELETE 707325
select count(*) from temp_m2;
 count 
-------
 59610

temp_m2.csv.zip

ebraker commented 2 years ago

@dustymc THANK YOU! That was some serious spreadsheet action. It was originally 875 columns to begin with since the library included Arctos URLs in addition to GUIDs. They also pull in world cat, getty and loc.gov taxonomies on their end based on Arctos fields, so I'd wager their version approaches or exceeds 1000 columns(!).

I'm loading these relationships currently. I will have at least one more translation exercise for you, but nothing anywhere near this size (mammal ledger series). I think the eggs and nests and fish series I can do on my own the old fashioned-way. Thanks again!

ebraker commented 2 years ago

@dustymc With Arctos down/restart this morning, it looks like my massive bulkload media relationships operation was interrupted. I was on the second round of 2500 out of 59600 relationships by the end of the day yesterday. Based on the speed, I think it will take another few days to ingest everything (hopefully that isn't what broke Arctos!). What is the best approach - should I just reload the spreadsheet knowing I'll get a lot (~5000?) duplicate errors? Or is there an easy way to know what's been linked and I can delete these from the spreadsheet? herp relationship load.zip

dustymc commented 2 years ago

Did you load part of the spreadsheet to the loader? I'm not sure I understand and I don't see anything of yours in the loader, but if you want to just load everything I can check what's been processed, or we can let the loader figure it out. (Or maybe you can't load everything?? If so let me know, I can - I hope...)

The loader shouldn't be able to break anything, and I can probably make it slightly faster (and so slightly more likely to break something...) if you want to let me know when it's running.

ebraker commented 2 years ago

@dustymc Yes, I loaded that spreadsheet yesterday and the loader was breaking it into 2500 record chunks. I got through one 2500 chunk and had set the next round to autoload around 8pm last night. Now that Arctos restarted, the loader is empty, but I confirmed that the media linkages don't exist for all 59600 records. I'll need to rerun the spreadsheet, but given it took 10 hours for 2500 to load, I was hoping I could shave off what records have already been ingested from the spreadsheet so it wouldn't have to handle duplicates (I just don't know which ones they are).

dustymc commented 2 years ago

Nevermind, I was looking at the wrong loader...

As far as I can tell it's properly deleting things as they load, I don't think there's any problem (but let me know specifics if you see something). Want me to set them all to autoload so you don't have to mess with tiny batches?

ebraker commented 2 years ago

Oh weird, now I can see the review and load page (it wasn't showing up before - thus my earlier comments). Yeah, you can set them all to autoload vs tiny batches (thanks).

Also, I tried doing the column transform by hand but your looping magic would be much more efficient... fish ledger relationships.csv egg ledger relationships.csv

dustymc commented 2 years ago

You should be autoloading.

here's fish. temp_m2(1).csv

CSV the other plz

ebraker commented 2 years ago

Thanks! CSV is up there (attached the wrong version the first time). Thanks!

dustymc commented 2 years ago

egg: temp_m2(2).csv

ebraker commented 2 years ago

Wahoo! Thanks!

ebraker commented 2 years ago

@dustymc aaaaaaand a couple more spreadsheet translations (thanks!!! I'm getting close to complete...): herp vol 1 - relationships.csv mammal ledger relationships.csv

dustymc commented 2 years ago

Archive 2.zip

ebraker commented 2 years ago

@dustymc Ok, I believe this is the last of the translations that I need (then the vast majority of UCM records will have associated ledger media!): mixed bird and mamm ledgers - relationships.csv Colorado College ledgers - relationships.csv osteology ledger data - relationships.csv

ebraker commented 1 year ago

@dustymc - see above. Would you be able to transform these at your convenience? Last round!

dustymc commented 1 year ago

Sorry, lost track of this. Here's the first.

temp_m2(5).csv.zip

The second has empty rows and the last I think has empty columns - could you check that and resend? (Or I can probably fix but I'd rather avoid that bug entry point if it's easy for you.)

ebraker commented 1 year ago

@dustymc Thank you! These are a major mess when I get them. Slightly better? Colorado College ledgers - relationships.csv osteology ledger data - relationships.csv

dustymc commented 1 year ago

Slightly better?

Nope, lots better.

major mess when I get them

I can only imagine!

Archive 3.zip

ebraker commented 1 year ago

Wahoo! THANK YOU!!!

ArctosDB / arctos

Request to add more media relationships (current loader maxes out at 5 relationships) #3314