Sequence cutoffs between 500-2000 bp matters little. For plasmids, 200 gets the best results, but the effect is small and unclear.
Smaller contigs does not interfere with mapping. There is no difference between filtering contigs before mapping, and in Vamb.
We were unable to get Vamb to show any signs of overfitting even when using 10% of the contigs of a pretty good assembly. It's still possible it can overfit if we have truly few, huge contigs, but we cannot determine this until we get good synthetic long-read datasets
So, in conclusion: None of this matters. We can remove the overfitting warning from Vamb
It would be nice if Vamb could automatically adjust the contig size cutoff, based on the characteristics on the input data. To do this, we need to: