GenSpectrum / cov-spectrum-website

A web platform to detect and analyze variants of SARS-CoV-2
https://cov-spectrum.org
GNU General Public License v3.0
60 stars 14 forks source link

samples based on population #751

Open aviczhl2 opened 1 year ago

aviczhl2 commented 1 year ago

Hi, I see an "estimated case" feature on cov-spectrum.

Given that countries have stopped most testing and no longer reporting the real number of cases, this feature seems redundant.

As countries are heading towards a new norm of ”rising sea level with high/low tide", their infection rate are circling around a constant at maybe 0.3-1% daily.

Under such circumstance, I guess an "estimated share among global population" that weights each sample with its corresponding regional population may be more useful.

A rough estimation of country-wise population can be extracted from worldometers

AnonymousUserUse commented 1 year ago

@corneliusroemer has also pointed out the less use of the "estimated cases" in https://github.com/GenSpectrum/cov-spectrum-website/issues/748.

As for me, I totally agree that the "estimated cases" might be hidden and replaced by "estimated share" weighted by population. However, I would like to mention that the number of sequences uploaded by some countries in a given time interval is too low. To make the estimate more stable, I suggest that countries with too low number of sequences should not be weighted by its population, but weighted by the number of sequence multiplied by a factor. To be more specific, for countries with less than 1 sequence per 1 million population in a week, its share of a variant may be multiplied by number of sequences times one million, instead of multiplying by its real population. In other words, weighted_base = min(population, uploaded_sequences*1,000,000).

Exemple of some typical countries for the week 2023-02-06 to 2023-02-12: Country Population Uploaded sequences Suggested weighted base
China 1,439,323,776 1,174 1,174,000,000
India 1,380,004,385 90 90,000,000
United States 331,002,651 16,714 331,002,651
Thailand 69,799,978 5 5,000,000
United Kingdom 67,886,011 10,984 67,886,011
Peru 32,971,854 0 0
Denmark 5,792,202 1,302 5,792,202
Norway 5,421,241 2 2,000,000
Puerto Rico 2,860,853 1 1,000,000
Curacao 164,093 1 164,093
Liechtenstein 38,128 0 0

It would be troublesome to simply weight the 70 million population in Thailand based on only 5 sequences in a week, so I think the number of uploaded sequences should also be considered by weighting to avoid problematic brought by too low number of sequences in some countries. However, excluding countries with e.g. less than 10 sequences would not be a good idea, since even a small number of uploaded sequences show at least some circulation of a variant in that place. Therefore, I propose weighted_base = min(population, uploaded_sequences*1,000,000), calculated for 7-day intervals. A better function for weighted_base = f(population, uploaded_sequences) may be investigated. In this way, sequences can be weighted by countries for a better estimation of the share of a variant globally and in a continent. For a more detailed analysis, this can also be done in division (e.g. province, state) level to estimate the share of a variant within a country, a continent, and globally. The inhomogeneity of sequencing intensity within certain countries is not negligible small.

A Webpage may be created to illustrate the sequencing intensity of different countries and divisions, allowing easier comparison.