Indexing improvements - Githubissues

Various indexing improvements to "save the user from themselves" :) aka to stop the creation of the peptide bloom filter if the table size or k-mer size is too small given the dataset.

Uses khmer.calc_expected_collisions(peptide_index) to compute occupancy of peptide index, exits and throws an error warning the user to increase the table size (https://github.com/czbiohub/sencha/issues/73)
Computes the fraction of observed k-mers divided by the fraction of theoretical k-mers, which I arbitrarily set to a maximum of 1e-4 because that was what worked for me empirically to produce translation results that I didn't think were total garbage. That corresponds to a k-mer size of 9 for Uniprot Mammals and protein encoding, which resulted in an observed/theoretical ratio of 3.643015e-05
Ignore k-mers containing * (stop codon), X (unknown amino acid), and U (Selenocystine amino acid) residues (https://github.com/czbiohub/sencha/issues/72)

PR checklist

[x] This comment contains a description of changes (with reason)
[x] If you've fixed a bug or added code that should be tested, add tests!
[x] Ensure the test suite passes with pytest . (command to run: pytest or make coverage if you want to see which lines don't have tests yet)
[x] Make sure your code is linted and autoformatted using black (black . --check).
[ ] Documentation in usage.md is updated
[ ] README.md is updated

@bluegenes this "should" help with some semi-useful error messages when creating the index to prevent "bad" indices and thus bad translation results from getting created.

czbiohub-sf / orpheum

Indexing improvements #74

PR checklist