Closed chaoran-chen closed 1 year ago
I wonder whether this could be related to the change of the primary key in the source data from genbankAccession
to strain
. Then, adopting the primary key would solve both this issue and #114.
How to reproduce:
load-nextstrain-genbank,transform-nextstrain-genbank,final-transforms,switch-in-staging
Let me know if this is due to something being wrong in our open data. I think strain names should be unique. They aren't in what we import but we chuck out non-uniques.
@corneliusroemer, would it be possible that at some point, aligned.fasta.xz
had duplicates (e.g. OX402637
)? If that were the case, we would have translated the same sequence twice. The pipeline would have tried to save two AA sequences for the same sample which is not allowed and what the error message is about. Because we cache the translations, the issue sustains even if the duplicates were removed from the source data file.
Anyways, at the moment, the fasta file is clean, so I'll clear the cache and reprocess all sequences (as soon as the GISAID pipeline has finished and we have capacity again). Let's hope that it will work!
Yes indeed, we did have duplicates. Maybe to be safe add a deduplicate step step or abort on duplicate. It could potentially happen again though we'll try our best to avoid that.
Also, you could use .zst for faster decompression from dozens of minutes to a few.
If you aren't yet watching ncov-ingest, I suggest you do 🙃
I'll ping you if this happens again though, see https://github.com/nextstrain/ncov-ingest/issues/387
Thanks! Yes, we should add a duplicate check!
The timing also fits. Both Theo's issue and this one were created three weeks ago.
Yes absolutely, I'm sorry I didn't connect the bug to this. My bad! Will bear it in mind in the future!
Don't worry! I could also have found the problem earlier, just haven't really had time to look into it.
The fact no one complained shows that open usage is limited :)
But with RKI data this would change, I promise :D
Update successful!
~Sometimes~ (now) always, the pipeline crashes with the following error message: