Various indexing improvements to "save the user from themselves" :) aka to stop the creation of the peptide bloom filter if the table size or k-mer size is too small given the dataset.
Uses khmer.calc_expected_collisions(peptide_index) to compute occupancy of peptide index, exits and throws an error warning the user to increase the table size (https://github.com/czbiohub/sencha/issues/73)
Computes the fraction of observed k-mers divided by the fraction of theoretical k-mers, which I arbitrarily set to a maximum of 1e-4 because that was what worked for me empirically to produce translation results that I didn't think were total garbage. That corresponds to a k-mer size of 9 for Uniprot Mammals and protein encoding, which resulted in an observed/theoretical ratio of 3.643015e-05
[x] This comment contains a description of changes (with reason)
[x] If you've fixed a bug or added code that should be tested, add tests!
[x] Ensure the test suite passes with pytest . (command to run: pytest or make coverage if you want to see which lines don't have tests yet)
[x] Make sure your code is linted and autoformatted using black (black . --check).
[ ] Documentation in usage.md is updated
[ ] README.md is updated
@bluegenes this "should" help with some semi-useful error messages when creating the index to prevent "bad" indices and thus bad translation results from getting created.
Various indexing improvements to "save the user from themselves" :) aka to stop the creation of the peptide bloom filter if the table size or k-mer size is too small given the dataset.
khmer.calc_expected_collisions(peptide_index)
to compute occupancy of peptide index, exits and throws an error warning the user to increase the table size (https://github.com/czbiohub/sencha/issues/73)*
(stop codon),X
(unknown amino acid), andU
(Selenocystine amino acid) residues (https://github.com/czbiohub/sencha/issues/72)PR checklist
pytest
ormake coverage
if you want to see which lines don't have tests yet)black . --check
).usage.md
is updatedREADME.md
is updated@bluegenes this "should" help with some semi-useful error messages when creating the index to prevent "bad" indices and thus bad translation results from getting created.