JohnLonginotto / uq

Encode FASTQ files to be much smaller, both compressed and uncompressed.
7 stars 1 forks source link

OverflowError: integer 510 does not fit 'uint8_t' #1

Open creggian opened 6 years ago

creggian commented 6 years ago

Hello,

I am exploring uQ tool for fastq compression and I got this error at the first run. The version I used is the one at commit 2aef5e2 (the latest)

Pass 2 of 4: Now QNAME delimiters are known, QNAMEs are being analysed. (98.47%)
Optimal encoding method for delimited data in QNAME determined! (0.369165285428 minutes)
    - Column 3 is type mapping stored as uint8
    - Column 3 is type integers stored as uint16
    - Column 3 is type integers stored as uint16

Traceback (most recent call last):
  File "/u/creggian/programmi/uq/uq.py", line 707, in <module>
    if variable_read_lengths: arrays = encoder_variable(status.total,dna_bases,quals,N_qual,dna_columns_needed,qual_columns_needed,status,args.input,bits_per_base,bits_per_quality)
  File "/u/creggian/programmi/uq/uq.py", line 242, in encoder_variable
    dna_array[row][dna_byte_position]   = temp_dna  + (variable_read_lengths << (dna_bits_done))  ; dna_byte_position -= 1
OverflowError: integer 510 does not fit 'uint8_t'

Can you help me in this direction? And is this project still mantained?

Thanks Claudio

JohnLonginotto commented 5 years ago

Hi Claudio!

uQ was never really maintained. It was a one-and-done mini-project i took on in the middle of writing my thesis to demonstrate that FASTQ is highly inefficient, and the work/grant money that has gone into creating FASTQ compression tools and manipulation tools was a waste of time. It was a "if the FASTQ standard had looked like this, several issues in bioinformatics wouldn't exist" demonstration. However it was consciously designed to be similar enough to FASTQ that it could be compared to FASTQ. This dependancy on being "FASTQ-like" for a apples-to-apples comparison is why uQ is not maintained. FASTQ (and consequently BAM and it's derivatives like CRAM) have a much more substantial "bug" in their design that goes beyond simple encoding bad practices. Fundamentally, FASTQ stores reads, BAM stores alignments, while a sequencing machine sequences fragments - a super-set of the prior two things. My thesis on why this distinction is relevant and how it results in a lot of confusion in the field because many people don't actually know this, was unfortunately never published. I had a mental breakdown, basically, and dropped out of science for the past year, as a result of trying to explain the BAM format to a biologically-inclined thesis committee. I'm not joking, i wish i was.

The error you are seeing isn't related to the uint8 value seen in the QNAME parser. That's just a happy coincidence. The error is generated from this line:

dna_array[row][dna_byte_position] = temp_dna + (variable_read_lengths << (dna_bits_done))

which is in the main ASCII-to-Binary loop. So that's a pretty big deal and I can't fix it without a FASTQ file that generates that error. It's probably the variable-read lengths that are causing the issue - i.e. you have a FASTQ read that isn't the same length as all the others. I did test for this, but clearly something is wrong.

I estimate i can fix this in 3 days if you still wish to use uQ for some comparison yourself, and i'm currently unemployed so it's no problem, haha, i got the time. But ultimately i suggest you stay away from uQ, FASTQ, and BAM as best you can. You cannot do decent genomics with these formats, because you can't store the information that you really want, the sequenced fragment.

creggian commented 5 years ago

Hi John, it took me a while to reply and I appreciate your patience.

Yes, the de facto genomic formats make many bioinformatician complain about their design. I appreciate your analysis in that direction and I am sorry for the situation you went through and had to handle.

I am exploring ways to compress genomic data in a lossless way, that is the reason I tried uQ. Is that possible to read your thesis even though it is at draft-stage?

Claudio

JohnLonginotto commented 5 years ago

Sorry for the delay in replying Claudio, I don't come to GitHub that often these days and it totally fell off my radar for a few months there!

If it is OK with you, I will try and publish the first chapter of the thesis on GitHub in the next day or two. The first chapter is about 80 pages long but covers most of the common anti-patterns in bioinformatics and why much of it comes down to bad data formats. The second chapter is on uQ. Third is on the futility of adapter marking and what annotation means for genomic data formats. Fourth, the black-box that is mapping and merging, and how the bad algorithmic design choices are a result of inappropriate data formats, and these unseen bugs in data format could be skewing our understanding of the genome substantially. Chapter 5 is a theoretical BAM format without all these technical debts. Chapter 6 is analysis specific to the project which i won't publish.

In Chapter 7, which i never wrote, I was going to talk about ACGTrie, which is another project on this github created during the PhD. It might interest you, because ACGTrie was designed to be an efficient way at storing DNA - not FASTQ specific, just DNA in general, in a "Trie" data structure which has many useful properties for DNA analysis that are unlike what we have available if the data comes in a flat table like it is in FASTQ/BAM etc. The prior chapters demonstrate how clunky data formats tie the analysis of the data down to a certain subset practical analyses, also known as Object-Relational Impedence Mismatch. It's a tough concept to explain in terms a biologist would grok, but I think I got there over 100 pages :P But once you do get it, Chapter 7 is kind of an eye-opener, because when you take a peek at what the least-clunky trie dataformat is for DNA, it's some monstrous 20-bytes-per-node balloon. For comparison, ACGTrie was the first non-tabular data structure i ever made, and it's 20-bytes-per-node (minimum) up to an infinite number of bytes, because it can store variable-length DNA - an advanced feature by genomics standards. The second version i designed on paper but never made a proof-of-concept of (PhD funding had ended at this point) had no minimum node size. Nodes were dynamic in size based on how many branches actually came off that node (and how much DNA they stored as per v1), which dropped the bloat on some test data by i think 80% or so. Smashes the record for smallest Trie in genomics by miles. I'm not a good coder. There is low-hanging fruit here! If I was going to make a data format for storing DNA, these days i'd make a variable node-size trei. At the very least it's something publishable, if you're into that ;-)

creggian commented 5 years ago

Hi John, thanks for writing back and for the long report. If making your thesis publishable online takes too much time, you can send me that privately.