biosails / pheniqs

Fast and accurate sequence demultiplexing
Other
26 stars 4 forks source link

EOF error #21

Closed droplet-lab closed 4 years ago

droplet-lab commented 5 years ago

Hi, I was running Pheniqs the other day on cat'd fastq.gz files from different lanes, great tool! It runs really well until the very end, where it generates an EOF error:

Pheniqs mux --config run_demux.json [W::bgzf_read_block] [W::bgzf_read_block] [W::bgzf_read_block] [W::bgzf_read_block] EOF marker is absent. The input is probably truncatedEOF marker is absent. The input is probably truncatedEOF marker is absent. The input is probably truncatedEOF marker is absent. The input is probably truncated

{ "multiplex": { "average classified confidence": 0.999889611317004, "average pf classified confidence": 0.999889611317004, "classified": [ { "ID": "GATCGTGT", "PU": "GATCGTGT", "average confidence": 0.999936941876671, "average pf confidence": 0.999936941876671, "barcode": [ "GATCGTGT" ], "concentration": 0.49, "count": 52793403, "index": 1, "low conditional confidence count": 107840070, "low confidence count": 107, "pf count": 52793403, "pf fraction": 1.0, "pf pooled classified fraction": 0.436594251439055, "pf pooled fraction": 0.11722381582157, "pooled classified fraction": 0.436594251439055, "pooled fraction": 0.11722381582157 }, { "ID": "AGATATAA", "PU": "AGATATAA", "average confidence": 0.999852933930732, "average pf confidence": 0.999852933930732, "barcode": [ "AGATATAA" ], "concentration": 0.49, "count": 68127573, "index": 2, "low conditional confidence count": 221602938, "low confidence count": 49, "pf count": 68127573, "pf fraction": 1.0, "pf pooled classified fraction": 0.563405748560944, "pf pooled fraction": 0.151272197204688, "pooled classified fraction": 0.563405748560944, "pooled fraction": 0.151272197204688 } ], "classified count": 120920976, "classified fraction": 0.268496013026259, "classified pf fraction": 1.0, "count": 450364140, "low conditional confidence count": 329443008, "low confidence count": 156, "pf classified count": 120920976, "pf classified fraction": 0.268496013026259, "pf count": 450364140, "pf fraction": 1.0, "unclassified": { "ID": "undetermined", "PU": "undetermined", "count": 329443164, "index": 0, "pf count": 329443164, "pf fraction": 1.0, "pf pooled classified fraction": 2.724450090445846, "pf pooled fraction": 0.73150398697374, "pooled classified fraction": 2.724450090445846, "pooled fraction": 0.73150398697374 } } }

Would you perhaps be able to tell me how to solve this error? On my side I downloaded the files multiple times to make sure the fasts weren't corrupted. Thanks!

moonwatcher commented 5 years ago

Can you check the number of reads in the fastq files and compare it to the number of reads Pheniqs outputs?

I suspect it’s a small change in htslib and is actually a false positive because of some variation in the gzip end markers.

That error is emitted from htslib, not directly from Pheniqs.

droplet-lab commented 5 years ago

Hi great, do you mean the number of reads of input/output before and after Pheniqs? This is quite hard to do as the input has many different samples in there. But when comparing with another tool. It seems like Pheniqs is returning the right amount of reads. So I would guess it struggles on the last read!

moonwatcher commented 5 years ago

I assume your input is gzip compresses fastq. Something like ‘gzcat your_input_file.fastq.gz|wc -l’ should give you the number of lines in one of the inputs. They should all have the exact number of lines, which is 4 times the number of reads (each read in the fastq format is exactly 4 lines).

I have noticed that too with the htslib 1.9 I suspect it’s benign but will investigate further and report.