Processing of the pipeline for open crashes

chaoran-chen commented 1 year ago

~Sometimes~ (now) always, the pipeline crashes with the following error message:

00:36:26.723 [pool-2-thread-20] DEBUG com.mchange.v2.c3p0.impl.DefaultConnectionTester - Testing a Connection in response to an Exception:
java.sql.BatchUpdateException: Batch entry 3,229 insert into y_main_aa_sequence_staging (id, gene, aa_seq_compressed)
values (183155, 'E', ?) was aborted: ERROR: duplicate key value violates unique constraint "y_main_aa_sequence_staging_pkey"
  Detail: Key (id, gene)=(183155, E) already exists.  Call getNextException to see other errors in the batch.
        at org.postgresql.jdbc.BatchResultHandler.handleError(BatchResultHandler.java:165)
        at org.postgresql.core.ResultHandlerDelegate.handleError(ResultHandlerDelegate.java:52)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2367)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2099)
        at org.postgresql.core.v3.QueryExecutorImpl.flushIfDeadlockRisk(QueryExecutorImpl.java:1456)
        at org.postgresql.core.v3.QueryExecutorImpl.sendQuery(QueryExecutorImpl.java:1481)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:546)
        at org.postgresql.jdbc.PgStatement.internalExecuteBatch(PgStatement.java:893)
        at org.postgresql.jdbc.PgStatement.executeBatch(PgStatement.java:916)
        at org.postgresql.jdbc.PgPreparedStatement.executeBatch(PgPreparedStatement.java:1684)
        at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeBatch(NewProxyPreparedStatement.java:2544)
        at ch.ethz.lapis.util.Utils.executeClearCommitBatch(Utils.java:101)
        at ch.ethz.lapis.transform.TransformService.lambda$compressAASeqs$1(TransformService.java:368)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "y_main_aa_sequence_staging_pkey"
  Detail: Key (id, gene)=(183155, E) already exists.
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2676)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2366)
        ... 15 common frames omitted

chaoran-chen commented 1 year ago

I wonder whether this could be related to the change of the primary key in the source data from genbankAccession to strain. Then, adopting the primary key would solve both this issue and #114.

fengelniederhammer commented 1 year ago

How to reproduce:

Setup Postgres DB locally
run LAPIS with --update-data with load-nextstrain-genbank,transform-nextstrain-genbank,final-transforms,switch-in-staging
(the update steps can also be executed one by one.)

corneliusroemer commented 1 year ago

Let me know if this is due to something being wrong in our open data. I think strain names should be unique. They aren't in what we import but we chuck out non-uniques.

chaoran-chen commented 1 year ago

@corneliusroemer, would it be possible that at some point, aligned.fasta.xz had duplicates (e.g. OX402637)? If that were the case, we would have translated the same sequence twice. The pipeline would have tried to save two AA sequences for the same sample which is not allowed and what the error message is about. Because we cache the translations, the issue sustains even if the duplicates were removed from the source data file.

Anyways, at the moment, the fasta file is clean, so I'll clear the cache and reprocess all sequences (as soon as the GISAID pipeline has finished and we have capacity again). Let's hope that it will work!

corneliusroemer commented 1 year ago

Yes indeed, we did have duplicates. Maybe to be safe add a deduplicate step step or abort on duplicate. It could potentially happen again though we'll try our best to avoid that.

Also, you could use .zst for faster decompression from dozens of minutes to a few.

corneliusroemer commented 1 year ago

If you aren't yet watching ncov-ingest, I suggest you do 🙃

I'll ping you if this happens again though, see https://github.com/nextstrain/ncov-ingest/issues/387

chaoran-chen commented 1 year ago

Thanks! Yes, we should add a duplicate check!

chaoran-chen commented 1 year ago

The timing also fits. Both Theo's issue and this one were created three weeks ago.

corneliusroemer commented 1 year ago

Yes absolutely, I'm sorry I didn't connect the bug to this. My bad! Will bear it in mind in the future!

chaoran-chen commented 1 year ago

Don't worry! I could also have found the problem earlier, just haven't really had time to look into it.

corneliusroemer commented 1 year ago

The fact no one complained shows that open usage is limited :)

But with RKI data this would change, I promise :D

chaoran-chen commented 1 year ago

Update successful!

GenSpectrum / LAPIS

Processing of the pipeline for open crashes #86