GenSpectrum / LAPIS

An API, a query engine, and a database schema for genomic sequences; currently with a focus on SARS-CoV-2
https://lapis-three.vercel.app
GNU Affero General Public License v3.0
21 stars 6 forks source link

baseline filter #328

Open aswarren opened 1 year ago

aswarren commented 1 year ago

Pulling down surveillance from the API includes all sequences no matter the reason. In the case of the US / GISAID this includes traveller surveillance, which if estimating prevalence for a particular area, can give a very different picture than domestic spread. Is there a way to filter sequences based on baseline sequencing tag? If not it would be useful to have.

chaoran-chen commented 1 year ago

We have a field samplingStrategy. You can see the available tags using fields=samplingStrategy, e.g., at https://lapis.cov-spectrum.org/open/v1/sample/aggregated?fields=samplingStrategy.

aswarren commented 1 year ago

Awesome! Thanks! Is there a field guide for explanation of A, X, Y, N?

{"errors":[],"info":{"apiVersion":1,"dataVersion":1690103788,"deprecationDate":null,"deprecationInfo":null,"acknowledgement":null},"data":[{"samplingStrategy":"A","count":48019},{"samplingStrategy":"X","count":192119},{"samplingStrategy":"Y","count":44101},{"samplingStrategy":"N","count":314101},{"samplingStrategy":null,"count":7683436}]}

corneliusroemer commented 1 year ago

Is there a field guide for explanation of A, X, Y, N?

The fields A,X,Y,N are shown only for data pulled from RKI (Germany's CDC) as opposed to Genbank. Their README is here: https://github.com/robert-koch-institut/SARS-CoV-2-Sequenzdaten_aus_Deutschland

image

It's a bit scrambled, the sentences seem incomplete. I would say: X: unknown whether targeted or not Y: sequencing done potentially due to interesting mutations/variant PCR A: Variant PCR suggested something of interest N: Representative sampling

I'm not sure about how reliable the annotation is though. I remember that when I looked into it a year ago, it seemed like representative sampling wasn't necessarily representative.

I think the field was introduced back in the day when labs started to do variant PCRs to get a quick idea of which variant a patient - as variant PCR was as fast as PCR and less delay than waiting for whole genome sequencing.

aswarren commented 1 year ago

Ah thanks very much to you both. Since @chaoran-chen example uses the open API, I also was also wondering about the binding from the "purpose_of_sampling" tag in NCBI to the codes explained by @corneliusroemer 's link? One example where the baseline tag ends up mattering in the US, is the CDC sequencing nasal swabs vs traveller surveillance. In previous months when pulling down the surveillance via API the growth curve of XBB.1.16 looked much more aggressive in domestic surveillance because traveller surveillance was being included. If I were estimating prevalence in a state I likely wouldn't want to include people landing at the airport domestic/international. That motivated my initial question about the ability to filter since presumably traveller surveillance wouldn't qualify for baseline or might be distinguishable in some way via that field. On NCBI the purpose_of_sampling can be accessed via CLI like so: $ datasets summary virus genome taxon sars-cov-2 --released-after 05/20/2023 | jq -r '.reports[] | select(.purpose_of_sampling != null) | [.accession,.purpose_of_sampling,.isolate.name] | @tsv' >ncbi_baseline.tsv Most of that command line magic was provided by Eric Cox at NCBI-Datasets

corneliusroemer commented 1 year ago

Ah very nice @aswarren! The open data comes gets to LAPIS via nextstrain/ncov-ingest and I don't think we currently use that purpose_of_sampling field there - though we definitely should.