dcjones / quip

Compressing next-generation sequencing data with extreme prejudice.
http://www.cs.washington.edu/homes/dcjones/quip/
BSD 3-Clause "New" or "Revised" License
78 stars 10 forks source link

feature request: support for colorspace sequence? #22

Open biocyberman opened 10 years ago

biocyberman commented 10 years ago

I see quip has very practical use in ngs area. For colorspace sequence, I can generate BAM files and compress with quip. However it will be more useful if quip support colorspace sequences in CSFASTQ or preferably in XSQ file format.

This link contains some info about XSQ http://www.lifetechnologies.com/dk/en/home/technical-resources/software-downloads/xsq-software.html

This is what I am thinking: currently quip works with basespace "character set" (A, C, T, G, and N). If quip can be generalized to work on any character set, then it can work with colorspace character set (0, 1, 2, 3, and . "dot"), or in other applications. If that is not so easy to implement, it is still possible to translate colorspace sequence to "fake" base space sequence (i.e.tr '0123.' 'A,C,T,G,N') and do the rest as basespace seuqence. This will solve CSFASTQ file format case right away. For binary XSQ files, it is a bit more complicate but I think we will discuss about it afterward.

dcjones commented 10 years ago

Adding support for CSFASTQ wouldn't be hard at all. So far I haven't bothered since I personally don't work with colorspace data much, and no one has asked me to until now.

XSQ would be more difficult for a couple of reasons.

  1. XSQ is implemented in Java and quip is in C. It's not impossible to call java code from c, but it would be painful and ugly.
  2. More importantly, XSQ has a restrictive license. I'm not a lawyer, but I don't think I could legally use it as part of quip:

    3.2.5 You agree not to modify, sell, rent, transfer (except temporarily in the event of a computer malfunction), resell for profit, or distribute this license or the Software, or create derivative works based on the Software, or any part thereof or any interest therein.

I don't know why these guys think it's a good idea to put pointless restrictions on their software, but so it is.

That said, XSQ is based on an actual open format (HDF5), so writing my own parser isn't totally out of the question.

biocyberman commented 10 years ago

Great to hear positive response for this request :-) You are right that XSQ is based on HDF5. I am actually testing some Python scripts I write to manipulate XSQ files based on h5py package (http://www.h5py.org/) which is a python interface of HDF5's C libraries. After taking a brief look at HDF5 documentation (http://www.hdfgroup.org/HDF5/doc/index.html) I believe that HDF5 supports Python, Java, C, and Fortran natively.

Regarding the restrictive license. I think it is for the XSQ tools that Lifetech releases, not the file format, nor the XS data files themselves. From what I understand, there is no issue with license to develop tools working with XSQ file: On page 533 of this Advanced User Guide:

What software accepts the XSQ file format? Initially, only LifeScope ™ Software supports the new format. Life Technologies Corporation is working with third ‐ party developers to adapt their workflows to support the new chemistry and data format

And for your information, here are some XSQ tools on Github: https://github.com/search?q=XSQ+solid&ref=cmdform

biocyberman commented 10 years ago

Maybe I did not get the whole point you wanted to say. Yes you may need to write a parser to take care of data, metadata, and attributes inside an XSQ file by using HDF5 C libraries

dcjones commented 10 years ago

Ok, I see. If it's just a simple well-documented hdf5 schema, then it shouldn't be too hard.

biocyberman commented 10 years ago

Awesome! If you need any sample XSQ file or CSFASTQ file please let me know. I will find out if I can send a minimal XSQ file with small size. I have many XSQ files over 8 GB, which are not good to be sent around.