NCEAS / metadig-engine

MetaDig Engine: multi-dialect metadata assessment engine
7 stars 5 forks source link

test metadig-engine on k8s against a hashstore #453

Open jeanetteclark opened 1 month ago

jeanetteclark commented 1 month ago

Testing locally has gone well but it would be nice to test the engine against a hashstore on the dev cluster

to that end I've mounted the tdg subvolume on metadig-worker, and that subvolume was mounted on dev.nceas where there is a hashstore metacat running. See helm/metadig-worker/pv.yaml and helm/metadig-worker/pvc.yaml for details on the existing mounts.

In order to actually test though the following steps are needed:

doulikecookiedough commented 1 month ago

Update:

The rsync + parallel process to copy the contents of /var/metacat/hashstore to /mnt/tdg-repos/dev/metacat/hashstore has been completed.

Next Steps:

To Do List:

For reference:

# How to produce a text file with just the first level of hashstore folders to rsync
mok@dev:~/testing$ sudo find /var/metacat/hashstore -mindepth 1 -maxdepth 1 > mc_hs_dir_list.txt
mok@dev:~/testing$ cat mc_hs_dir_list.txt
/var/metacat/hashstore/objects
/var/metacat/hashstore/metadata
/var/metacat/hashstore/refs
/var/metacat/hashstore/hashstore.yaml

# How to use rsync with a list of folders
mok@dev:~/testing$ cat mc_hs_dir_list.txt | parallel --eta sudo rsync -aHAX {} /mnt/tdg-repos/dev/metacat/hashstore/
# First get the list of files found under `/hashstore`
mok@dev:~/testing$ sudo find /var/metacat/hashstore -type f -printf '%P\n' > mc_obj_list.txt

# How to feed a single command at a time for a file to rsync
# The /./ between `metacat` and `hashstore` instructs rsync to copie folders from hashstore (and omits the previous directories) into the desired folder
mok@dev:~/testing$ parallel --eta sudo rsync -aHAXR /var/metacat/./hashstore/{} /mnt/tdg-repos/dev/metacat :::: mc_obj_list.txt
doulikecookiedough commented 1 month ago

Metacat on dev.nceas.ucsb.edu has been moved over to write to the ceph fs mount point - a symlink has been created between /var/metacat/hashstore and /mnt/tdg-repos/dev/metacat/hashstore.

rsync was re-ran and the process to sync with a list of direct subfolders after /var/metacat/hashstore was the fastest. I tested with feeding rsync individual commands (ex. via :::: list_of_files.txt) but this seemed to be very slow. The re-sync process took approximately 5 minutes.

doulikecookiedough commented 1 month ago

Current Status:

It appears the 'Assessment Reports' (Metadig) for datasets at dev.nceas.ucsb.edu is not working as expected:

Next Steps:

1) Restoring expected Metadig functionality @ dev.nceas.ucsb.edu

2) Obtaining the last missing feature-hashstore-support image for metadig-controller

3) Deploying feature-hashstore-support for Metadig in full on the dev cluster

To Do List & Follow-up Questions

doulikecookiedough commented 1 month ago

Update:

To Do List & Follow-up Questions

doulikecookiedough commented 1 month ago

Update:

doulikecookiedough commented 3 weeks ago

Update:

Even after fixing the connection URL (below), I am still experiencing a http 403 forbidden error.

String encodedId = URLEncoder.encode(identifier, "UTF-8");
// This is necessary for metacat's solr to process the requested queryUrl
String encodedQuotes = URLEncoder.encode("\"", "UTF-8");
String queryUrl = nodeEndpoint + "/query/solr/?q=isDocumentedBy:" + encodedQuotes + encodedId + encodedQuotes + "&fl=id";

The logging message which shows the end point is accessible via both the browser, and within the metadig-worker pod itself. Metacat's solr index does not have specific access control rules so this get request from the metadig-worker should be able to be processed.

doumok@Dou-NCEAS-MBP14.local:~/Code/testing/metadig $ kubectl exec -it metadig-worker-75c5689d69-4tt4v /bin/sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.

# curl "https://dev.nceas.ucsb.edu/knb/d1/mn/v2/query/solr/?q=isDocumentedBy:%22urn%3Auuid%3Aae970e0a-3a26-4af7-8a84-235c9a8e3a5d%22&fl=id"
<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">19</int>
  <lst name="params">
    <str name="q">isDocumentedBy:"urn:uuid:ae970e0a-3a26-4af7-8a84-235c9a8e3a5d"</str>
    <str name="fl">id</str>
    <str name="fq">(readPermission:"public")OR(writePermission:"public")OR(changePermission:"public")OR(isPublic:true)</str>
    <str name="wt">javabin</str>
    <str name="version">2</str>
  </lst>
</lst>
<result name="response" numFound="5" start="0" numFoundExact="true">
  <doc>
    <str name="id">urn:uuid:9ebcadac-b015-48fb-a2c5-1ff7db692f19</str></doc>
  <doc>
    <str name="id">urn:uuid:75db2307-4b78-4a8b-bc59-5b2ce318519f</str></doc>
  <doc>
    <str name="id">urn:uuid:ae970e0a-3a26-4af7-8a84-235c9a8e3a5d</str></doc>
  <doc>
    <str name="id">urn:uuid:52106ea7-f24b-4247-a697-272023fb158e</str></doc>
  <doc>
    <str name="id">urn:uuid:b3dd42d8-7489-4d95-bcba-81940bdefbe2</str></doc>
</result>
</response>

The DATAONE_AUTH_TOKEN does not seem to make any difference (confirmed that it's been set in the environment variable both in the logs, and with the command kubectl exec -t metadig-worker-75c5689d69-4tt4v -- env)

# Error log

20241025-21:43:14: [DEBUG]: Running suite: FAIR-suite-0.4.0 [edu.ucsb.nceas.mdqengine.MDQEngine:97]
20241025-21:43:14: [DEBUG]: Got token from env. [edu.ucsb.nceas.mdqengine.MDQEngine:241]
20241025-21:43:16: [DEBUG]: queryURL: https://dev.nceas.ucsb.edu/knb/d1/mn/v2/query/solr/?q=isDocumentedBy:%22urn%3Auuid%3Aae970e0a-3a26-4af7-8a84-235c9a8e3a5d%22&fl=id [edu.ucsb.nceas.mdqengine.MDQEngine:264]
20241025-21:43:16: [ERROR]: Unable to run quality suite. [edu.ucsb.nceas.mdqengine.Worker:224]
edu.ucsb.nceas.mdqengine.exception.MetadigException: Unable to run quality suite for pid urn:uuid:ae970e0a-3a26-4af7-8a84-235c9a8e3a5d, suite FAIR-suite-0.4.0Failed : HTTP error code : 403
    at edu.ucsb.nceas.mdqengine.Worker.processReport(Worker.java:568)
    at edu.ucsb.nceas.mdqengine.Worker$1.handleDelivery(Worker.java:212)
    at com.rabbitmq.client.impl.ConsumerDispatcher$5.run(ConsumerDispatcher.java:149)
    at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:111)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.RuntimeException: Failed : HTTP error code : 403
    at edu.ucsb.nceas.mdqengine.MDQEngine.findDataPids(MDQEngine.java:275)
    at edu.ucsb.nceas.mdqengine.MDQEngine.runSuite(MDQEngine.java:120)
    at edu.ucsb.nceas.mdqengine.Worker.processReport(Worker.java:564)
    ... 6 more
20241025-21:43:16: [DEBUG]: Saving quality run status after error [edu.ucsb.nceas.mdqengine.Worker:240]
20241025-21:43:16: [DEBUG]: Saving to persistent storage: metadata PID: urn:uuid:ae970e0a-3a26-4af7-8a84-235c9a8e3a5d, suite id: FAIR-suite-0.4.0 [edu.ucsb.nceas.mdqengine.model.Run:272]
20241025-21:43:16: [DEBUG]: Done saving to persistent storage: metadata PID: urn:uuid:ae970e0a-3a26-4af7-8a84-235c9a8e3a5d, suite id: FAIR-suite-0.4.0 [edu.ucsb.nceas.mdqengine.model.Run:277]
20241025-21:43:16: [DEBUG]: Saved quality run status after error [edu.ucsb.nceas.mdqengine.Worker:249]
20241025-21:43:16: [DEBUG]: Sending report info back to controller... [edu.ucsb.nceas.mdqengine.Worker:390]
20241025-21:43:16: [INFO]: Elapsed time processing (seconds): 0 for metadataPid: urn:uuid:ae970e0a-3a26-4af7-8a84-235c9a8e3a5d, suiteId: FAIR-suite-0.4.0
 [edu.ucsb.nceas.mdqengine.Worker:422]

I have a feeling that this is related to how k8s allows external REST API calls to be made (or not). The specific JAVA code to make the get request appears to be fine (since it can communicate and receive a 403 error). Investigation continues.

mbjones commented 3 weeks ago

I have a feeling that this is related to how k8s allows external REST API calls to be made (or not).

k8s does not restrict pods from originating web connections to external hosts in any way unless it is configured to do so. MetaDIG is not configured to restrict anything afaik. You and I should touch base on this because I think you are following a red herring and the problem originates elsewhere. Your curl command from the pod shows that the connection is not blocked. So its something else about how you deployed. Let's chat.

doulikecookiedough commented 3 weeks ago

@mbjones I think so too - I can't find anything related to that. I just pushed a commit to test whether the request is getting rejected because it's missing a User-Agent property. I'll send you a PM via Slack and/or send you a calendar invite.

Deployment code for quick reference (taken from hand-off notes): helm upgrade metadig-worker ./metadig-worker --namespace metadig --set image.pullPolicy=Always --set replicaCount=1 --recreate-pods=true --set k8s.cluster=dev

With the following changes in the respective metadig-worker deployment files:

doulikecookiedough commented 3 weeks ago

@mbjones The Assessment Report generated after adding the User-Agent property to the Java Code!

To Do List & Follow-up Questions

doulikecookiedough commented 3 weeks ago

Update:

mbjones commented 3 weeks ago

@doulikecookiedough regarding your question on how to directly communicate with metadig, that would be via the API. Most operations require authentication, but you can, for example, access completed run reports with a request like:

https://api.test.dataone.org/quality/runs/FAIR-suite-0.4.0/urn:uuid:0b44a2d5-dcd5-4798-8072-4030b14e8936

This one doesn't work, as it appears the FAIR-suite-0.4.0 was not run for the PID listed. You can get an overview of the whole API at https://api.test.dataone.org/quality/ -- but note that only a portion of the planned methods were implemented - others are still TBD, and some were disabled for security reasons. A useful one is getting the list of current suites, which is at https://api.test.dataone.org/quality/suites/.

If the API doesn't provide what you need, you can query the database itself via psql.

doulikecookiedough commented 3 weeks ago

Thank you for the clarification/direction @mbjones. Currently it looks like there's an issue with the scheduler - after restarting the pods (making sure the chart and app versions were both updated), some NullPointerExceptions are being thrown. This may explain why the FAIR-suite-0.4.0 check isn't being run for the new PIDs that are being added in the urn:node:mnTestKNB nodes.

20241028-18:02:10: [ERROR]: quality-test-dataone-fair: error creating rest client: Cannot assign field "after" because "link.before" is null [edu.ucsb.nceas.mdqengine.scheduler.RequestReportJob:190]
20241028-18:02:10: [INFO]: Job metadig.quality-test-dataone-fair threw a JobExecutionException:  [org.quartz.core.JobRunShell:218]
org.quartz.JobExecutionException: java.lang.NullPointerException: Cannot assign field "after" because "link.before" is null [See nested exception: java.lang.NullPointerException: Cannot assign field "after" because "link.before" is null]
    at edu.ucsb.nceas.mdqengine.scheduler.RequestReportJob.execute(RequestReportJob.java:191)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
Caused by: java.lang.NullPointerException: Cannot assign field "after" because "link.before" is null
    at org.apache.commons.collections.map.AbstractLinkedMap.removeEntry(AbstractLinkedMap.java:293)
    at org.apache.commons.collections.map.AbstractHashedMap.removeMapping(AbstractHashedMap.java:543)
    at org.apache.commons.collections.map.AbstractHashedMap.remove(AbstractHashedMap.java:325)
    at org.apache.commons.configuration.BaseConfiguration.clearPropertyDirect(BaseConfiguration.java:133)
    at org.apache.commons.configuration.AbstractConfiguration.clearProperty(AbstractConfiguration.java:503)
    at org.apache.commons.configuration.CompositeConfiguration.clearPropertyDirect(CompositeConfiguration.java:269)
    at org.apache.commons.configuration.AbstractConfiguration.clearProperty(AbstractConfiguration.java:503)
    at org.apache.commons.configuration.AbstractConfiguration.setProperty(AbstractConfiguration.java:483)
    at org.dataone.client.rest.HttpMultipartRestClient.setDefaultTimeout(HttpMultipartRestClient.java:588)
    at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:222)
    at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:199)
    at org.dataone.client.rest.HttpMultipartRestClient.<init>(HttpMultipartRestClient.java:184)
    at edu.ucsb.nceas.mdqengine.scheduler.RequestReportJob.execute(RequestReportJob.java:188)
    ... 2 more

20241028-18:30:00: [ERROR]: Job metadig.downloads threw an unhandled Exception:  [org.quartz.core.JobRunShell:222]
java.lang.NullPointerException
    at java.base/java.io.FileInputStream.<init>(Unknown Source)
    at java.base/java.io.FileInputStream.<init>(Unknown Source)
    at java.base/java.io.FileReader.<init>(Unknown Source)
    at edu.ucsb.nceas.mdqengine.scheduler.AcquireWebResourcesJob.execute(AcquireWebResourcesJob.java:97)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
20241028-18:30:00: [ERROR]: Job (metadig.downloads threw an exception. [org.quartz.core.ErrorLogger:2360]
org.quartz.SchedulerException: Job threw an unhandled exception. [See nested exception: java.lang.NullPointerException]
    at org.quartz.core.JobRunShell.run(JobRunShell.java:224)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
Caused by: java.lang.NullPointerException
    at java.base/java.io.FileInputStream.<init>(Unknown Source)
    at java.base/java.io.FileInputStream.<init>(Unknown Source)
    at java.base/java.io.FileReader.<init>(Unknown Source)
    at edu.ucsb.nceas.mdqengine.scheduler.AcquireWebResourcesJob.execute(AcquireWebResourcesJob.java:97)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
    ... 1 more
doulikecookiedough commented 3 weeks ago

Check-in:

To Do

doulikecookiedough commented 2 weeks ago

Check in:

To Do