PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
204 stars 103 forks source link

support for compressed fasta files #548

Open aclum opened 7 years ago

aclum commented 7 years ago

This is a feature request. It would be useful if FALCON supported commonly used compression formats for input files.

pb-jchin commented 7 years ago

@aclum can you be more specific? fasta.gz fastq.gz?

pb-cdunn commented 7 years ago

We already support .fasta.gz and .dexta. I doubt anything out there is smaller or faster than .dexta.

pb-cdunn commented 7 years ago

Is TwoBit what you had in mind?

I was just discussing that format with a colleague. To us, that is a broken standard, but not terrible.

There's a 'standard' for storing FASTA files as .2bit files for compression, but I am befuddled as to why they chose T-00, C-01, A-10, G-11. If they chose A-T and C-G to be bitwise complements of each other then certain operations become much simpler (e.g., you can reverse complement a kmer stored in a 32 or 64 bit value looplessly with bitops) and just makes more sense. I use A-00, C-01, G-10, T-11 which is easy to remember because of order.

aclum commented 7 years ago

I was thinking of .gz, we tried it with the smrtlink code and it didn't work. When was support for this added?

On Wed, Apr 12, 2017 at 2:06 PM, Christopher Dunn notifications@github.com wrote:

Is TwoBit https://genome.ucsc.edu/goldenpath/help/twoBit.html what you had in mind?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/FALCON/issues/548#issuecomment-293707759, or mute the thread https://github.com/notifications/unsubscribe-auth/AH0R7_gWGYPANCGSFKzqMfoFBnUXQsi9ks5rvTzggaJpZM4M7reF .