Closed bluegenes closed 4 years ago
@bluegenes the code generally looks good to me, just tiny comments on names, logging
are you going to use this in a kmermaid pipeline, then you might have to change the version here - https://github.com/nf-core/kmermaid/blob/dev/environment.yml#L38 to master
@olgabot please review when you get a chance, thanks!
Oops I forgot to leave a comment. Thanks for the PR! @pranathivemuri got a lot of the big things and my comments are more nitpicky for clarity. One big question is:
What if someone provides a folder of bloom filters with --peptides-are-bloom-filter
? What is the expected behavior there? Do the bloom filters get merged, or does only the first one encountered by os.listdir
get used? For simplicity, I think the easiest thing is to only allow one item when --peptides-are-bloom-filter
for now, and consider merging indexes/bloom filter as a future feature.
What do you think?
Here's some example logging code:
import logging
# Create a logger
logging.basicConfig(format='%(name)s - %(asctime)s %(levelname)s: %(message)s')
logger = logging.getLogger(__file__)
logger.setLevel(logging.INFO)
logger.info("Did not specify a sketch size or scale with any of "
"--sketch_size, --sketch_size_log2, --sketch_scaled, --sketch_scaled_log2! "
"Falling back on sourmash's default of --sketch_scaled 500")
What if someone provides a folder of bloom filters with
--peptides-are-bloom-filter
? What is the expected behavior there? Do the bloom filters get merged, or does only the first one encountered byos.listdir
get used? For simplicity, I think the easiest thing is to only allow one item when--peptides-are-bloom-filter
for now, and consider merging indexes/bloom filter as a future feature.
index-from-dir
is only an option in index
, not translate
, so it only handles cases with fasta files that can be read by screed (doesn't handle bloom filters at all). You could extend the functionality to translate
, but I don't think it's necessary and it adds this complication.
In translate
, The cli allows only one peptide/bloom filter input, right? (@click.argument("peptides", nargs=1)
)
So we should only ever have to deal with one bloom filter?
In
translate
, The cli allows only one peptide/bloom filter input, right? (@click.argument("peptides", nargs=1)
) So we should only ever have to deal with one bloom filter?
For now, yes, we're only dealing with one bloom filter. This may change in the future but for now let's assume that's the case :)
motivation: I wanted to build an index from a directory with many protein files.
I created a new flag
--index-from-dir
forsencha index
. If this flag is specified, thepeptides
supplied are considered a directory, which we look in for*fa.gz
,*.faa.gz
,*.fa
files. Default functionality remains the same, with the exception thatpeptides
are handled as a list[peptides]
Main changes: modified functions to accept a list of peptides files rather than a single file, and modified
make_peptide_bloom_filter
to actually loop through and use these files for the bloom filter.This is likely not be the best way to do this, but it's working for now (index tests passing).
thoughts and ideas @olgabot @pranathivemuri?