samples based on population

@corneliusroemer has also pointed out the less use of the "estimated cases" in https://github.com/GenSpectrum/cov-spectrum-website/issues/748.

As for me, I totally agree that the "estimated cases" might be hidden and replaced by "estimated share" weighted by population. However, I would like to mention that the number of sequences uploaded by some countries in a given time interval is too low. To make the estimate more stable, I suggest that countries with too low number of sequences should not be weighted by its population, but weighted by the number of sequence multiplied by a factor. To be more specific, for countries with less than 1 sequence per 1 million population in a week, its share of a variant may be multiplied by number of sequences times one million, instead of multiplying by its real population. In other words, weighted_base = min(population, uploaded_sequences*1,000,000).

Exemple of some typical countries for the week 2023-02-06 to 2023-02-12: Country	Population	Uploaded sequences	Suggested weighted base
China	1,439,323,776	1,174	1,174,000,000
India	1,380,004,385	90	90,000,000
United States	331,002,651	16,714	331,002,651
Thailand	69,799,978	5	5,000,000
United Kingdom	67,886,011	10,984	67,886,011
Peru	32,971,854	0	0
Denmark	5,792,202	1,302	5,792,202
Norway	5,421,241	2	2,000,000
Puerto Rico	2,860,853	1	1,000,000
Curacao	164,093	1	164,093
Liechtenstein	38,128	0	0

It would be troublesome to simply weight the 70 million population in Thailand based on only 5 sequences in a week, so I think the number of uploaded sequences should also be considered by weighting to avoid problematic brought by too low number of sequences in some countries. However, excluding countries with e.g. less than 10 sequences would not be a good idea, since even a small number of uploaded sequences show at least some circulation of a variant in that place. Therefore, I propose weighted_base = min(population, uploaded_sequences*1,000,000), calculated for 7-day intervals. A better function for weighted_base = f(population, uploaded_sequences) may be investigated. In this way, sequences can be weighted by countries for a better estimation of the share of a variant globally and in a continent. For a more detailed analysis, this can also be done in division (e.g. province, state) level to estimate the share of a variant within a country, a continent, and globally. The inhomogeneity of sequencing intensity within certain countries is not negligible small.

A Webpage may be created to illustrate the sequencing intensity of different countries and divisions, allowing easier comparison.

GenSpectrum / cov-spectrum-website

samples based on population #751