Python-bioinformatics / bioinformatics

Pycon 2017 Bioinformatics Interest Group
1 stars 0 forks source link

Goals #1

Open malonge opened 7 years ago

malonge commented 7 years ago

We should think about what our goals are, and make sure that we are providing something that isnt already available through projects such as biopython or scikit-bio.

Here are my specific needs:

File I/O

- The ability to iterate through a sequence and alignment files with python generators. (Currently available in biopython and scikit-bio)
- The ability to get a subset of sequences or alignments from a file given query headers. (Not sure if these are available elsewhere)
- The ability to query a sequence file with exact sequence matches or subsequece matches
- The ability to read a whole sequence or alignment file into memory. (Available in biopython and scikit-bio)
- In all the above cases, I personally like the option to handle sequences or alignments as built-in strings/lists, or objects with added utilities. (Still figuring out if this is possible with [scikit-bio and their "into" keyword argument](http://scikit-bio.org/docs/0.2.3/io.html) )

Genome Assembly Utilities

I have not been able to find these utilities in biopython or scikit-bio

- Genome assembly stats for a collection of sequences
- Various coverage calculations
- Objects for specific sequencing chemistries:
    - e.g. HiC and Mate Pair sequences and alignments
- Reference genome objects 
    -  Utilities for:
        - gaps
        - ambiguous sequences
        - assembly stats
        - lift over tools
        - subsequence searches

Those are just what come to mind for me. Please let me know if there are python packages out there that address these things well already.

JHibbard commented 7 years ago

Other languages have some of the utilities listed above. A quick-wins approach might be to create wrappers for these types of programs and start replacing the functionality with pure-python solutions as needed. We could provide/find docker images that with the non-python libraries installed to ease installation.

I'm also interested in creating RESTful APIs for common biology tasks. If we used a REST schema like OpenAPI/Swagger we could provide client sdks in many languages automatically with OpenAPI's codegen tools.