gbif / ipt

GBIF Integrated Publishing Toolkit (IPT)
https://www.gbif.org/ipt
Apache License 2.0
127 stars 57 forks source link

Metadata doesn't save #1925

Closed AMNHcjohnson closed 1 year ago

AMNHcjohnson commented 1 year ago

Hi, I am trying to upload a new dataset to GBIF. However, whenever I fill out the metadata section and save it, when I go back to the resource page, nothing is saved and I fill the info out all over again. I've done it about 5 times now...what am i doing wrong? Chris

mike-podolskiy90 commented 1 year ago

@AMNHcjohnson Thank you for contacting us. I need some more information please - what IPT version do you use? Do you have any exceptions displayed?

AMNHcjohnson commented 1 year ago

Hi Mikhail,

It's Integrated Publishing Toolkit (IPT) Version 2.5.5-ra872e56

Probably out of date - it's also not showing the managed resource file I had uploaded. I didn't see any exception or anything that would indicate the information wouldn't save.

Thanks. Chris

mike-podolskiy90 commented 1 year ago

I don't remember anything like that. Could you send me your IPT logs please? Or provide me with administrator rights for your IPT? And, if possible, I would recommend you to update your IPT to the most recent version (2.6.3 currently)

AMNHcjohnson commented 1 year ago

Hi, I will have to ask our IT department to upgrade to the new version. I am not sure where to go to get the log, but if you can tell below, the AMNH-Crustacea is not showing up in my managed resources, yet, I cannot "create a new resource" with the same Name because it says it exists.

Where would I find the logs for this?

Thank. Chris

mike-podolskiy90 commented 1 year ago

I'm sorry I don't quite understand, you can't see the resource now? Have you deleted it or it just disappeared? Log file is available for the admin users in the Administration -> Logs, or you can download them directly from the server: IPT data dir -> logs

AMNHcjohnson commented 1 year ago

Correct I don’t see it. I never deleted it.

Christine Johnson, Curatorial Associate American Museum of Natural History

AMNHcjohnson commented 1 year ago

Hi Mikhail,

This morning when I logged in, I could see my Crustacea resource.

I went ahead and published it (or set it from private to public). How long does is take to determine whether everything is correct? In the log there are a few question marks – does that mean these are incorrect? Or is it still determining whether all is true?

In addition, I am a big confused as to why I have datasets visible on GBIF, but when I search under our institution code, the records don’t appear although the institution code field is populated in these files.

Here is the log:

Archive generation started for version # 1.0 Start writing data file for Darwin Core Occurrence No lines were skipped due to errors for mapping Darwin Core Occurrence in source amnhcrustaceacollection202317b No lines were skipped due to errors for mapping Darwin Core Occurrence in source amnhcrustaceacollection202317b No lines with fewer columns than mapped for mapping Darwin Core Occurrence in source amnhcrustaceacollection202317b All lines match the filter criteria for mapping Darwin Core Occurrence in source amnhcrustaceacollection202317b Data file written for Darwin Core Occurrence with 15285 records and 53 columns All data files completed EML file added meta.xml archive descriptor written Validating the core file: occurrence.txt. Depending on the number of records, this can take a while. ? Validating the core basisOfRecord is always present and its value matches the Darwin Core Type Vocabulary. ? Validating the core ID field occurrenceID is always present and unique. No lines are missing occurrenceID No lines have duplicate occurrenceID ✓ Validated each line has a occurrenceID, and each occurrenceID is unique No lines are missing a basisOfRecord All lines have basisOfRecord that matches the Darwin Core Type Vocabulary No lines have ambiguous basisOfRecord 'occurrence'. ✓ Validated each line has a basisOfRecord, and each basisOfRecord matches the Darwin Core Type Vocabulary Archive validated Archive has been compressed Archive version # 1.0 generated successfully!

mike-podolskiy90 commented 1 year ago

I'm glad to hear you managed to publish your resource. Question mark in the publication log simply indicates that the validation process was started. As you can see further in the log the IPT reported all went successfully.

What is your dataset please? After publishing in the IPT it might take some time for the dataset to be indexed by GBIF.

AMNHcjohnson commented 1 year ago

Hi Mikhail,

Sorry for the bother again, but something still isn't working correctly. Although it shows there is a dataset, the dataset search comes up with 0 occurrences. Can someone please help me determine, why this is so?

Chris

AMNHcjohnson commented 1 year ago

Hi Again,

In my search for my Crustacea records, I came across this "finding" - something seems very off here. Crustace in GBIF backbone is a genus under the bee family Apidae.

It looks like the Benthic Baseline Biodiversity Survey has the wrong taxon string.

https://www.gbif.org/dataset/36449c1f-679d-4235-b34e-1c275ebcd968

Chris

mike-podolskiy90 commented 1 year ago

@AMNHcjohnson I'm glad to help, but I don't know what dataset we're talking about. Could you send me the link please? And, if possible, create an admin account in your IPT, that would help to diagnose what's going on.

mike-podolskiy90 commented 1 year ago

@ManonGros Could you assist with this please?

AMNHcjohnson commented 1 year ago

Thanks Mikhail, @ManonGros has access - I need to request IT to add you as well. I already asked them to install the updated release but that hasn't happened yet.

What email should I give our IT for you?

I realize there is something wrong with my file - when I try to keep the dates from turning into text in excel, I think something else went awry with my file.

I'm always so close but never can get over this hurdle and I have 750K records I would like to share.

Thanks. Chris

mike-podolskiy90 commented 1 year ago

mpodolskiy@gbif.org

AMNHcjohnson commented 1 year ago

Hi again, here is the link. I've asked IT to create an admin account for you.

https://ipt.amnh.org/manage/resource.do?r=amnh-crustacea

The resource is amnh-crustacea

Chris

AMNHcjohnson commented 1 year ago

Hi Mikhail,

Okay - our IT department has updated the IPT version & added you as a managed user (you should have received an email from them). I detected some errors in my upload file, which I fixed, and published a new dataset. It looks like everything is fine, however, I still see 0 occurrences for this dataset.

https://www.gbif.org/dataset/a8035a1d-e674-4d2a-bb59-b476af6a3d6d

Any help you can provide to identify the misstep would be appreciated (so I can go forward with our remaining datasets.

Best, chris

AMNHcjohnson commented 1 year ago

Hi again Mikhail,

I really need help – I have tried to publish this dataset many, many times – the log says successful, the ingestion history say finishReason: ABORT.

I don’t understand what is wrong with the file that it can’t be publish.

The dataset is American Museum of Natural History (AMNH) Crustacea Collection.

Here are some links:

https://www.gbif.org/dataset/a8035a1d-e674-4d2a-bb59-b476af6a3d6d

https://registry.gbif.org/dataset/a8035a1d-e674-4d2a-bb59-b476af6a3d6d/ingestion-history

Thanks. Chris

ManonGros commented 1 year ago

Hi @AMNHcjohnson I will take a look today

ManonGros commented 1 year ago

@AMNHcjohnson it looks like we are unable to access the archives from your IPT. This could be due to some firewall settings. It looks like it isn't just this dataset, for example the last time we were able to access the archive from this dataset (https://www.gbif.org/dataset/a8035a1d-e674-4d2a-bb59-b476af6a3d6d) was in July 2021. You can find more information in our IPT manual here: https://ipt.gbif.org/manual/en/ipt/latest/installation#opening-the-ipt-to-the-internet

I will close this issue as I don't think this is a problem with the IPT software. Please follow up with us at helpdesk@gbif.org, thanks!

ManonGros commented 1 year ago

@AMNHcjohnson One of my colleagues noticed that your IPT is behind Cloudflare, which is blocking machine access from our servers. You will need to configure Cloudflare to permit access to at least GBIF's servers, 130.225.43.0/25.

AMNHcjohnson commented 1 year ago

Thank Marie!!!

I will forward this to our IT department. What a relief.

Chris

bvirgilioamnh commented 1 year ago

Hey All! AMNH IT Here :)

I'll dig into the logs on our side of things, but my guess is that we're blocking it because it is automated/bot traffic. While we most certainly can add the range to our allow list it isn't the preferred solution as it does negate some security controls. We heavily leverage Cloudflare's Bot Management solution to help mitigate aggressive crawlers and data scrapers, unfortunately some legitimate solutions do run afoul of this. Coincidentally July 2021 is when we enabled this service within Cloudflare, so that adds up nicely.

Do the GBIF servers make requests to servers that include a specific user agent (e.g. GBIF Metadata Bot v1.0) instead of a generic user agent (e.g. Curl, Python Requests, etc)? If not, that'd be the first step. And then from there you can request that Cloudflare marks the bot as verified. We're happy to leverage our account and support team at Cloudflare to help assist with this if necessary.

https://developers.cloudflare.com/bots/reference/verified-bots-policy/

You can submit the bot verification on their Google Form: https://forms.gle/pWVxfCj6cQgWGxDp9

Source documentation for the Google Form link (because why is Cloudflare using Google Forms for this? I'm not entirely sure...) https://blog.cloudflare.com/friendly-bots/

-Ben

MattBlissett commented 1 year ago

Hi Ben,

The IPT tool provides a managed data repository; the purpose is to allow programmatic access to the published data, with GBIF as the primary user.

I have completed the form, though I doubt we meet the scale Cloudflare requires. I think there are only 4 IPT installations behind Cloudflare, and yours is the only one with these tightened security settings. For https://ipt.amnh.org/ we would normally make 8 HTTP requests per week.

Our user agents include COLServer (COLServer/24a3ae9 2022-12-20), org.GBIF.utils/1.16 (Java/11.0.17; M-1800000-25-2; +https://www.gbif.org/), GBIF-Url-Validator and Thumbor/6.7.0. As far as I know, no-one is currently using user agents to allow/block access to an IPT, so we have not made any particular effort to align or maintain these. A few publishers do limit access to 130.225.43.0/24.

Other biodiversity systems or researchers also access IPTs using various tools or scripting languages. In the last week, I can see two researchers/groups have used Python and RStudio to query IPTs at https://cloud.gbif.org/. Blocking Python, Curl etc will block these users.

Matt

bvirgilioamnh commented 1 year ago

Ahh ok understood. Thanks for submitting it anyways, I'll pass this up to our account rep at Cloudflare just to let them know. Un/fortunately the way the bot management works is essentially based on "machine learning" (of course taken with a grain of salt 😄) and is built off the reputation of known user agents, we're not explicitly allowing/denying them. We're just given the ability to say block automated traffic, allow "good bots", captcha "likely automated" and ultimately try to balance accessibility with excessive scraping (and other more security related issues) across all of our sites.

We'll review implementing IP level controls to address this on our end.

Thanks Matt!