build index from directory of peptide files

czbiohub-sf / orpheum

Orpheum (Previously called and published under sencha) is a Python package for directly translating RNA-seq reads into coding protein sequence.

MIT License

18 stars 4 forks source link

build index from directory of peptide files #66

Closed bluegenes closed 4 years ago

bluegenes commented 4 years ago

motivation: I wanted to build an index from a directory with many protein files.

I created a new flag --index-from-dir for sencha index. If this flag is specified, the peptides supplied are considered a directory, which we look in for *fa.gz, *.faa.gz, *.fa files. Default functionality remains the same, with the exception that peptides are handled as a list [peptides]

Main changes: modified functions to accept a list of peptides files rather than a single file, and modified make_peptide_bloom_filter to actually loop through and use these files for the bloom filter.

This is likely not be the best way to do this, but it's working for now (index tests passing).

thoughts and ideas @olgabot @pranathivemuri?

pranathivemuri commented 4 years ago

@bluegenes the code generally looks good to me, just tiny comments on names, logging

pranathivemuri commented 4 years ago

are you going to use this in a kmermaid pipeline, then you might have to change the version here - https://github.com/nf-core/kmermaid/blob/dev/environment.yml#L38 to master

pranathivemuri commented 4 years ago

@olgabot please review when you get a chance, thanks!

olgabot commented 4 years ago

Oops I forgot to leave a comment. Thanks for the PR! @pranathivemuri got a lot of the big things and my comments are more nitpicky for clarity. One big question is:

What if someone provides a folder of bloom filters with --peptides-are-bloom-filter? What is the expected behavior there? Do the bloom filters get merged, or does only the first one encountered by os.listdir get used? For simplicity, I think the easiest thing is to only allow one item when --peptides-are-bloom-filter for now, and consider merging indexes/bloom filter as a future feature.

What do you think?

olgabot commented 4 years ago

Here's some example logging code:

import logging

# Create a logger
logging.basicConfig(format='%(name)s - %(asctime)s %(levelname)s: %(message)s')
logger = logging.getLogger(__file__)
logger.setLevel(logging.INFO)

        logger.info("Did not specify a sketch size or scale with any of "
                    "--sketch_size, --sketch_size_log2, --sketch_scaled, --sketch_scaled_log2! "
                    "Falling back on sourmash's default of --sketch_scaled 500")

bluegenes commented 4 years ago

What if someone provides a folder of bloom filters with --peptides-are-bloom-filter? What is the expected behavior there? Do the bloom filters get merged, or does only the first one encountered by os.listdir get used? For simplicity, I think the easiest thing is to only allow one item when --peptides-are-bloom-filter for now, and consider merging indexes/bloom filter as a future feature.

index-from-dir is only an option in index, not translate, so it only handles cases with fasta files that can be read by screed (doesn't handle bloom filters at all). You could extend the functionality to translate, but I don't think it's necessary and it adds this complication.

In translate, The cli allows only one peptide/bloom filter input, right? (@click.argument("peptides", nargs=1)) So we should only ever have to deal with one bloom filter?

olgabot commented 4 years ago

In translate, The cli allows only one peptide/bloom filter input, right? (@click.argument("peptides", nargs=1)) So we should only ever have to deal with one bloom filter?

For now, yes, we're only dealing with one bloom filter. This may change in the future but for now let's assume that's the case :)