apache / solr-operator

Official Kubernetes operator for Apache Solr
https://solr.apache.org/operator
Apache License 2.0
243 stars 111 forks source link

backup to GCS and S3 never completes #547

Open pbackup12345 opened 1 year ago

pbackup12345 commented 1 year ago

I have a GKE autoPilot pod with a relatively vanilla setup with tlsTermination at Ingress and 3 pods. Everything works except for the backup feature.

I use Solr-operator 0.6.0, solr: 8.11.0 Zookeeper: 0.2.14

Following are the relevant parts of my setup:

Main yaml on S3 location:

  backupRepositories:
    - name: "gcs-backups-1"
      gcs:
        bucket: "backupbx"
        gcsCredentialSecret:
          name: "gcssecret1"
          key: "service-account-key.json"
        baseLocation: "d"
    - name: "s3-backups-1"
      s3:
        region: "us-east-1" # Required
        bucket: "solr-xsearch" # Required
        credentials:
          accessKeyIdSecret: # Optional
            name: aws-secrets
            key: access-key-id
          secretAccessKeySecret: # Optional
            name: aws-secrets
            key: secret-access-key

My backup yaml:

apiVersion: solr.apache.org/v1beta1
kind: SolrBackup
metadata:
  name: local-backup14
  namespace: sop030
spec:
  repositoryName: "s3-backups-1"
  solrCloud: explore
  collections:
    - dsearch

The backup actually starts and both location (S3 and GCS) receives files as well, but after a while the backup process stops. There are no solr error messages, but this is the relevant portion of the logs of the pod which does the backup:

2023-04-08 14:42:00.734 INFO  (parallelCoreAdminExecutor-19-thread-3-processing-n:explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr x:dsearch_shard1_replica_n1 local-backup14-dsearch8519612781285 BACKUPCORE) [c:dsearch s:shard1 r:core_node2 x:dsearch_shard1_replica_n1] o.a.s.h.IncrementalShardBackup Done creating backup snapshot at s3:/local-backup14-dsearch/dsearch/ shardBackupMetadataFile:org.apache.solr.core.backup.ShardBackupId@46768e0a
2023-04-08 14:42:01.396 INFO  (qtp1306246648-19) [   ] o.a.s.h.a.CoreAdminOperation Checking request status for : local-backup14-dsearch8519612781285
2023-04-08 14:42:01.396 INFO  (qtp1306246648-19) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={qt=/admin/cores&requestid=local-backup14-dsearch8519612781285&action=REQUESTSTATUS&wt=javabin&version=2} status=0 QTime=0
2023-04-08 14:42:01.398 INFO  (OverseerThreadFactory-29-thread-5-processing-n:explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr) [c:dsearch   ] o.a.s.c.a.c.BackupCmd Starting to backup ZK data for backupName=local-backup14-dsearch
2023-04-08 14:42:04.358 INFO  (qtp1306246648-18) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=4
2023-04-08 14:42:04.358 INFO  (qtp1306246648-21) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=4
2023-04-08 14:42:09.357 INFO  (qtp1306246648-20) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=3
2023-04-08 14:42:14.358 INFO  (qtp1306246648-18) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=3
2023-04-08 14:42:14.361 INFO  (qtp1306246648-21) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=5
2023-04-08 14:42:16.585 INFO  (OverseerThreadFactory-29-thread-5-processing-n:explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr) [c:dsearch   ] o.a.s.c.a.c.BackupCmd Completed backing up ZK data for backupName=local-backup14-dsearch
2023-04-08 14:42:16.655 INFO  (OverseerThreadFactory-29-thread-5-processing-n:explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr) [c:dsearch   ] o.a.s.c.OverseerTaskQueue Response ZK path: /overseer/collection-queue-work/qnr-0000000043 doesn't exist. Requestor may have disconnected from ZooKeeper
2023-04-08 14:42:18.400 INFO  (OverseerCollectionConfigSetProcessor-144115198735482887-explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr-n_0000000002) [   ] o.a.s.c.OverseerTaskQueue Response ZK path: /overseer/collection-queue-work/qnr-0000000043 doesn't exist. Requestor may have disconnected from ZooKeeper
2023-04-08 14:42:19.358 INFO  (qtp1306246648-20) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=3
2023-04-08 14:42:24.359 INFO  (qtp1306246648-18) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=4
2023-04-08 14:42:24.360 INFO  (qtp1306246648-22) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=3
2023-04-08 14:42:29.359 INFO  (qtp1306246648-20) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=4
2023-04-08 14:42:34.360 INFO  (qtp1306246648-18) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=6
2023-04-08 14:42:34.360 INFO  (qtp1306246648-24) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=6
2023-04-08 14:42:39.358 INFO  (qtp1306246648-21) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={} status=0 QTime=4
2023-04-08 14:42:42.779 INFO  (qtp1306246648-18) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={requestid=local-backup14-dsearch&action=REQUESTSTATUS&wt=json} status=0 QTime=3
2023-04-08 14:42:42.797 INFO  (qtp1306246648-22) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={requestid=local-backup14-dsearch&action=REQUESTSTATUS&wt=json} status=0 QTime=2

And this is where it seems to die:

2023-04-08 14:42:18.400 INFO  (OverseerCollectionConfigSetProcessor-144115198735482887-explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr-n_0000000002) [   ] o.a.s.c.OverseerTaskQueue Response ZK path: /overseer/collection-queue-work/qnr-0000000043 doesn't exist. Requestor may have disconnected from ZooKeeper

From there the log only has endless calls to check the requeststatus with no results:

2023-04-08 14:43:27.884 INFO  (qtp1306246648-23) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={requestid=local-backup14-dsearch&action=REQUESTSTATUS&wt=json} status=0 QTime=2

Additionally through normal api calls the backup and restore function works perfectly. Specifically the following runs without a hitch:

https://<solr-server>/solr/admin/collections?action=BACKUP&name=local-backup15&collection=dsearch&repository=s3-backups-1&location=/

And my main yaml in full:

apiVersion: solr.apache.org/v1beta1
kind: SolrCloud
metadata:
  name: explore
spec:
  backupRepositories:
    - name: "gcs-backups-1"
      gcs:
        bucket: "backupbx"
        gcsCredentialSecret:
          name: "gcssecret1"
          key: "service-account-key.json"
        baseLocation: "d"
    - name: "s3-backups-1"
      s3:
        region: "us-east-1" # Required
        bucket: "solr-xsearch" # Required
        credentials:
          accessKeyIdSecret: # Optional
            name: aws-secrets
            key: access-key-id
          secretAccessKeySecret: # Optional
            name: aws-secrets
            key: secret-access-key
  customSolrKubeOptions:
    ingressOptions:
      ingressClassName: nginx
    podOptions:
      resources:
        limits:
          memory: 3Gi
        requests:
          cpu: 700m
          memory: 3Gi
  dataStorage:
    persistent:
      pvcTemplate:
        spec:
          resources:
            requests:
              storage: 2Gi
      reclaimPolicy: Delete
  solrSecurity:
    authenticationType: Basic
  replicas: 2
  solrImage:
    repository: solr
    tag: 8.11.0
  solrJavaMem: -Xms500M -Xmx500M
  updateStrategy:
    managed:
      maxPodsUnavailable: 2
      maxShardReplicasUnavailable: 2
    method: Managed
  zookeeperRef:
    provided:
      chroot: /explore
      image:
        pullPolicy: IfNotPresent
        repository: pravega/zookeeper
        tag: 0.2.15
      persistence:
        reclaimPolicy: Delete
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 2Gi
      replicas: 3
  solrAddressability:
    external:
      domainName: <mydomain redacted>.xyz
      method: Ingress
      hideNodes: true
      useExternalAddress: false
      ingressTLSTermination:
        tlsSecret: explore-solr-tls-letsencrypt
HoustonPutman commented 1 year ago

Thanks for providing so much information!

Does this happen after a few backups complete, or does it happen from the first backup?

pbackup12345 commented 1 year ago

It happens after first installation, when I created a collection and added a very few records. And it happens repeatedly. I.e. I started from scratch several times and bumped into the same problem with a default autoPilot GKE cluster.

HoustonPutman commented 1 year ago

So at this line, it should have finished almost all of the backup logic and the files should exist in s3/gcs. The line after that will delete any old incremental backups if necessary (which it shouldn't be, as its the first backup...).

2023-04-08 14:42:16.585 INFO  (OverseerThreadFactory-29-thread-5-processing-n:explore-solrcloud-1.explore-solrcloud-headless.sop030:8983_solr) [c:dsearch   ] o.a.s.c.a.c.BackupCmd Completed backing up ZK data for backupName=local-backup14-dsearch

Also before that error can happen, it should save the state of the completed task. So whenever the requeststatus request is sent, it should have information to populate the result.

From there the log only has endless calls to check the requeststatus with no results:

Can you show me what this means? What does the "no results" json look like?

Have you tried running an async backup yourself?

cesarfm commented 1 year ago

Hello all. I am observing exactly the same symptoms described above.

Moreover, I see something that might be of interest for the investigation.