genomika / biopandas

Biopandas provides tools for the analysis and comprehension of high-throughput genomic data.
6 stars 0 forks source link

Sequence Representation in Biopandas #2

Open marcelcaraciolo opened 10 years ago

marcelcaraciolo commented 10 years ago

The first features is to provide to Biopandas mechanisms to represent the sequence data. We can say sequence as String object representing biological sequences with alphabets.

Examples: Sequence (Generic Container), DNASequence, RNASequence, AASequence.

References:

http://www.bioconductor.org/packages/release/bioc/manuals/Biostrings/man/Biostrings.pdf

https://github.com/biopython/biopython/

luizirber commented 10 years ago

Maybe extend the numpy string? By default when you load a string in Pandas it is an object, so not so efficient. Since sequences are rarelly changed after creation it might be a good tradeoff.

See also: https://github.com/pydata/pandas/issues/5261

luizirber commented 10 years ago

Also interesting: http://scikit-bio.org/

marcelcaraciolo commented 10 years ago

Hi @luizirber we have three strategies that we can follow or you may suggest another approaches. I will list here:

A) We can decide to create our own data structures by ourselves ignoring the work at scikit-bio and follow one alternative for sequence, creating an pyrex such as this: https://gist.github.com/marcelcaraciolo/11297014. The advantages are that we can model using our own optimized data structures, different from the biopython approach using Sequence extending object and using basestring object as sequence (sub-optimized even the hierarchy).

A.2) Instead we could use the numpy.string + cython, there is a ticket in scikit-bio related to it, take a look: https://github.com/biocore/scikit-bio/issues/60. They are performing some benchmarks about the approaches. Nowadays they use the collections's sequence A nice approach by the way! https://docs.python.org/2/library/collections.html#collections.Sequence

A.3) We could decide to help the scikit-bio project and use only their data structures and do our work focusing on our own architecture and approaches for translation, sequencing, alignment, etc. The drawback is that in the future it may shock two main projects with the same goals under development.

A.4) Close this project now, and make a fork of scikit-bio and only contribute to them, it's an option too. However, I don't know their goals and if their release plan is in accordance with my personal plans at research.

What are your suggestions and opinions ?

Regards,

Marcel

luizirber commented 10 years ago

The best thing about Pandas is the workflow, since everything you can do with it can also be done with numpy/scipy, but with more intermediate steps. I don't think scikit-bio will focus on Pandas workflows, so A.4 is not really an option.

I think we should avoid doing another implementation of basic stuff and focus on the high-level workflow, and so use scikit-bio as a building block. And, if we need to implement missing features in scikit-bio they benefit from it, as much as we do.

I think A.3 is the best way.

luizirber commented 10 years ago

scikit-bio is already using pandas: https://github.com/biocore/scikit-bio/search?p=1&q=pandas&ref=cmdform

A recent discussion: https://github.com/biocore/scikit-bio/issues/241

ElDeveloper commented 10 years ago

Hey @marcelcaraciolo, @luizirber I'm a scikit-bio dev, we just came across this issue while we were going through our issues. Let us know if you need anything from us. We are in active development and we are interested in helping others and in finding contributors/collaborators.

cc @gregcaporaso