GenSpectrum / cov-spectrum-website

A web platform to detect and analyze variants of SARS-CoV-2
https://cov-spectrum.org
GNU General Public License v3.0
60 stars 14 forks source link

Are insertions broken? #992

Closed corneliusroemer closed 2 months ago

corneliusroemer commented 3 months ago

I can't see any insertions, but there should be insertions in JN.1

https://cov-spectrum.org/explore/World/AllSamples/Past6M/variants?nextcladePangoLineage=JN.1*&

Brave Browser 2024-05-30 18 28 46
Taepper commented 3 months ago

~It seems to be only for this lineage in particular?~ Ah, other pages were still cached for me.

corneliusroemer commented 3 months ago

Alex and I investigated and it looks like all data ingested (submitted) since 2023-01-21 does not have ingestions. There's a sharp cutoff between 2023-01-20 and 2023-01-21.

That makes it very likely that it's ingest related?

See: https://lapis.cov-spectrum.org/gisaid/v2/sample/nucleotideInsertions?dateSubmittedFrom=2023-01-21&accessKey=9Cb3CqmrFnVjO3XCxQLO6gUnKPd -> no insertions since 2023-01-21

But before: https://lapis.cov-spectrum.org/gisaid/v2/sample/nucleotideInsertions?dateSubmittedFrom=2023-01-20&accessKey=9Cb3CqmrFnVjO3XCxQLO6gUnKPd

chaoran-chen commented 3 months ago

Hmm. Did you check the data that SILO gets? Are the insertions missing there?

chaoran-chen commented 3 months ago

I am now extracting the nucleotide insertions from the file that we give to SILO via

zstdcat provision.1716905734.ndjson.zst | \
  jq -c '{gisaidEpiIsl: .metadata.gisaidEpiIsl, nucleotideInsertions: .nucleotideInsertions.main, dateSubmitted: .metadata.dateSubmitted}' | \
  zstd > nucleotideInsertions.ndjson.zstd

This will take a while. I'll let you know tomorrow.

chaoran-chen commented 3 months ago

@Taepper, I used zstdcat nucleotideInsertions.ndjson.zstd | jq -c 'select((.nucleotideInsertions | length > 0) and .dateSubmitted > "2023-02-23")' | less to see the sequences with insertions after 2023-02-23. I found 253678 entries with insertions.

corneliusroemer commented 3 months ago

Interestingly, open does show some insertions as well, even for recently submitted sequences, but not for recently colllected ones https://open.cov-spectrum.org/explore/World/AllSamples/AllTimes/variants?&

https://open.cov-spectrum.org/explore/World/AllSamples/Past6M/variants?&