gbif / ipt

GBIF Integrated Publishing Toolkit (IPT)
https://www.gbif.org/ipt
Apache License 2.0
127 stars 57 forks source link

Publishing datasets using Python script fails or is very unstable after IPT was updated to v2.7.2 #1973

Closed fujiokae closed 1 year ago

fujiokae commented 1 year ago

We have more than 1,000 datasets registered in our IPT. To add new datasets or update the existing ones, we use a python script to automate the processes. The script worked fine for the previous version of IPT but now it's updated to 2.7.2 and the publishing through the script is very unstable. The script updates EML and resource files, uploads them to the server and makes a request to publish it.

Sometimes, the process is done successfully but IPT does not recognize or stops loading datasets (the home page shows much less datasets than there should be (e.g. "400 resources" instead of "1,113 resources") until I go to "Administration"-"IPT Setting". Without clicking [Save] or [Cancel], just go back to the home. Then, IPT lists all the datasets. It is, though, more often that IPT fails to recognize the updated & uploaded EML file and stops loading datasets. In this case, when I go to "Administration"-"IPT Setting", it shows error logs which usually say "Failed to reconstruct resource: /var/lib/ipt/resources/zd_1931/eml-1.9.xml not found!" where zd_1931 is the dataset name that I was trying to update (the previous version was 1.8 and new version is supposed to be 1.9).

In other occasions, it seems the update is successful but the change summary is missing (of course, the script includes the change summary text). Then, I have to manually update it.

So, the outcome differs case by case and it's very hard to solve this. Please investigate this issue. If there is an "official" way (e.g. APIs) to add or update datasets through scripts, please teach me. Updating and publishing datasets manually is too overloaded (sometimes I have to update ~100 datasets), so using the script is critical.

The excerpt of the script that sends a publish request:

# Cookie-based authentication
hostUrl = 'https://%s/' % (ipt_server)
loginFormUrl = hostUrl + 'login.do'
loginUrl = hostUrl + 'login.do'
publishUrl = hostUrl + 'manage/publish.do'
# Input parameters we are going to send
payload = {
      'email': '....',
      'password': '....',
      'csrfToken': ''
}

s = requests.Session()
response = s.post(loginFormUrl, verify = True)
if response.status_code != 200:
    print "Failed to initiate login process: %s" % (response.status_code)
    sys.exit()

payload['csrfToken'] = s.cookies.get('CSRFtoken')
response = s.post(loginUrl, data = payload)
if response.status_code != 200: # Could get a 502 Bad Gateway error.
    print "Failed to login: %s" % (response.status_code)
    sys.exit()
else:
    print "Sucessfully logged in to %s" % (hostUrl)

# Send a publish request
resource = "(dataset name)"
params = {'r' : resource,
        'autopublish': '',
          'currPubMode' : 'AUTO_PUBLISH_OFF',
          'pubMode': '',
          'currPubFreq': '',
          'pubFreq': '',
       'publish': 'Publish',
       'summary': "change summary text..."

contents = s.post(publishUrl, data = params)        
# if successful, contents = <200>. But it is often <404>
bart-v commented 1 year ago

Hi Ei,

For us (EurOBIS) that still works properly for 1092 datasets We only register the DwC endpoint and let the actual publishing happen via the "publish all" button. Even in 2.7.2, still very stable.

#register DwC endpoint
        $endpoint = [ "type" => "DWC_ARCHIVE", "url" => "http://{$ipt_host}/{$ipt_path}/archive.do?r=".$dataset_name ];
        $endpoint_json = json_encode($endpoint);
        $out = $http->post( GBIF_API_BASEURL . "/dataset/{$dataset_key}/endpoint", $endpoint_json, ["Content-Type"=>"application/json"] );

All based on https://github.com/gbif/registry/blob/master/registry-examples/src/test/scripts/register.sh

fujiokae commented 1 year ago

Thanks for a quick reply. Could you elaborate it more? I'm unclear what you meant by "only register the DwC endpoint". How do you update the metadata (EML) by script? When you click "Publish all", are all datasets, whether they are updated or not, published with the version number up? Or is this function smart enough to publish only those datasets having newer EML / resource files?

mike-podolskiy90 commented 1 year ago

@fujiokae Thank you for contacting us I'll see what we can do. Have you spotted any errors in the logs or something?

abubelinha commented 1 year ago

@fujiokae would you mind sharing that Python script? (willing to do something similar)

Thanks @abubelinha

fujiokae commented 1 year ago

As I mentioned in my first post, the logs say "Failed to reconstruct resource: /var/lib/ipt/resources/zd_1931/eml-1.9.xml not found!"

I think when IPT experiences this error, it stops loading other datasets. I uploaded an updated EML file (eml.xml) and resource.xml for this dataset. I didn't upload eml-1.9.xml. I guess IPT tried to copy the latest eml.xml into eml-1.9.xml but failed? The version number is indicated in resource.xml: \1.9\ ... \\\1.9\2023-03-16 00:30:57 UTC\UNRESERVED\PUBLIC\Monthly update\1031...

and also in eml.xml <eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:dc="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 http://rs.gbif.org/schema/eml-gbif-profile/1.1/eml.xsd" packageId="https://ipt.env.duke.edu/resource?id=zd_1931/v1.9" system="http://gbif.org" scope="system" xml:lang="eng">

Strangely, after seeing this error, when I publish the dataset manually, IPT returns normal.

bart-v commented 1 year ago
mike-podolskiy90 commented 1 year ago

@fujiokae I have tried your script on our UAT IPT and it went fine.

What exactly you update/upload please? Could it be isues with read/write rights?

During publication one of the steps is EML publishing when the IPT crates a versioned EML file and copies data from the eml.xml in there. If that fails, the IPT alerts user on UI about the exception and reverts EML version, but does not log an error. This can be your issue, we need to have additional error logging here

fujiokae commented 1 year ago

Thanks for testing the code and sorry I didn't include in the sample code that the publishing request is in a loop. I wanted it to be simple so you could understand well. Yes, it seems that sending a publishing request for just one dataset usually succeeds but when it is in a loop for many datasets, the second request gets <404> from the IPT. I had a similar issue with the previous version and thought IPT needed more time to process. So, I added time.sleep(duration) where duration was 20 seconds or so. Then, this approach fails with <404> error for IPT 2.7.2

Here is the code with loop (only # Send a publish request part)

# Send a publish request
duration = 20
for resource in resourcesToUpdate:
    try:
        params = {'r' : resource,       # resource = dataset name
                'autopublish': '',
                  'currPubMode' : 'AUTO_PUBLISH_OFF',
                  'pubMode': '',
                  'currPubFreq': '',
                  'pubFreq': '',
               'publish': 'Publish',
               'summary': "change summary text..."

        contents = s.post(publishUrl, data = params)        
        # The first dataset tends to be successful with <200>. The ones from the second on tend to be an error with <404>

        time.sleep(duration)
    except urllib2.HTTPError, e:
        print 'HTTPError = ' + str(e.code)
    except urllib2.URLError, e:
        print 'URLError = ' + str(e.reason)
    except Exception:
        print 'generic exception'
mike-podolskiy90 commented 1 year ago

@fujiokae Thank you, I understand that it works in a loop. I tried it with a bunch of resources and it went just fine. I can create you an account in one of our test IPT and you can try to run your script there if you like

fujiokae commented 1 year ago

Thanks, @mike-podolskiy90 If you can create a test account for me, that would be great! I will also consult with the IT person in my lab to see if there are server or network settings that may cause the problem...

mike-podolskiy90 commented 1 year ago

@fujiokae No problem! Just give me your email address please, so I can send you credentials

fujiokae commented 1 year ago

Mine is efujioka@duke.edu. Thanks.

fujiokae commented 1 year ago

Thanks, @mike-podolskiy90 for taking care of. Really appreciate it.

I mostly solved the issue. My script stopped IPT server (> systemctl stop ipt) before uploading the updated eml.xml and resource.xml. Then, restarted IPT (> systemctl stop ipt) before sending publish requests. It seems IPT needs a lot of time to get ready to accept requests. I modified the script so it does not stop & restart IPT. Then, the workflow went well without any issues!

Only remaining issue is that it seems IPT does not recognize a new dataset when the script creates a new folder and uploads new eml.xml and resource.xml while IPT is running. Is that true? Does IPT search through folders at an interval (e.g. every 30 minutes or so) to find new entries? Is there a request for IPT to look for new entries on-demand?

mike-podolskiy90 commented 1 year ago

Glad to hear you figured it out!

No, IPT does not recognize externally created resources until you restart it, because it loads the resources on startup. And there is no such request, unfortunately

rukayaj commented 1 year ago

Not to hijack this thread, but @fujiokae i'm interested in why you do data publication through the IPT at all when it sounds like you are generating your own EML + ready mapped dwc files? It seems like it'd be simpler to just host them somewhere and publish using the GBIF API?

fujiokae commented 1 year ago

@rukayaj Good question. I mean I didn't know there is a way to register datesets using GBIF API. That would sound wonderful but we organize a huge catalog of marine megavertebrates called OBIS-SEAMAP, which is a node of OBIS. OBIS harvests nodes' data through their IPT servers. That's the solo reason we stick with IPT.