VEuPathDB / plot.data

1 stars 0 forks source link

Observations on zero imputation behaviour in megastudy #247

Open bobular opened 7 months ago

bobular commented 7 months ago

The imputation of the missing zeroes looks good but at the moment it depends on the data in the subset.

Here is the megastudy with the following filters: Study->Institution = Iowa State Mosquito Surveillance Collection->Start date = 2012-01-01 to 2012-01-31 Sample->Species = Culiseta inornata Make a floating time series plot and bin by week (auto zoom x-axis)

image

Now if we add the most common species (in N. America) to the filter Sample->Species = Culiseta inornata + Aedes vexans

We get this image

Now we can go to the marker config and deselect Aedes vexans, so that we only see the Culiseta inornata numbers (though the Aedes vexans data is still in the subset and gets sent to the lineplot plugin)

image

bobular commented 7 months ago

What the final plot is showing is that the addition of Aedes vexans to the subset has provided collection events for which zeroes for Culiseta inornata can be added. So there are zeroes reported during July and August rather than big gaps between points (as in the first plot above). The mean specimen counts are reduced by the addition of more zeroes.

This is all good, but we'd need to communicate this to users somehow.

Of course the best way to ensure the maximum number of collections are included in the subset is to do no filtering on species at all. The user could use marker config to show a small number of species - especially if we implement https://github.com/VEuPathDB/web-monorepo/issues/511 - the only thing missing would be collections where absolutely nothing was collected (no species at all) - but these are very rare. We'll need to have a think what power users are able to download (to make sure they get the "empty collections").