TheFraserLab / ASEr

Get ASE counts from BAMs or raw fastq data -- repackage of pipeline by Carlo Artieri
MIT License
6 stars 3 forks source link

Change input/output tables to be handled by pandas #4

Open MikeDacre opened 8 years ago

MikeDacre commented 8 years ago

This would allow for easier maintainability, as well as calculating and/or plotting statistics.

petercombs commented 8 years ago

I'm unsure how much maintainability there is to be gained by switching the code to pandas. Part of the problem here is the A|C|G|T format for each SNP, since it violates the implicit pandas assumption that a single column is a single entity (one number/a string).

See Pull request #9 for an example of how much the code can be cleaned up. Some quick testing suggests that there's no major speed benefit or penalty to switching the output code to pandas.

MikeDacre commented 8 years ago

Two questions:

  1. Could we change the A|C|G|T format from a single column to 4 columns with 0/1 for present/absent, or is that too complex?
  2. Do you think it is worth it? What I like about pandas is that it allows future extensibility: we could add code later that could use the data-frame, or we could even output the data-frame with pickle to be used programmatically later with a single import.