blachlylab / fade

Fragmentase Artifact Detection and Elimination
MIT License
11 stars 3 forks source link

Non-printable characters in ab:Z field #33

Open divonlan opened 1 year ago

divonlan commented 1 year ago

Greetings, I am the author of Genozip, a compression software for BAM/FASTQ/VCF etc (www.genozip.com). One of our users opened a support ticket regarding a failing compression of a FADE-generated BAM file. After much investigation, it turned out that the issue was a single alignment (out of half a billion) which had an ab:Z field that appears to be entirely corrupted - many non-printable characters.

ab:Z:^L!"^!"! !^\"""^K^^\ ^\^M^]^R^Z"^L^M!^N"""^M!^N^]^\ESC""^L ^N^O^]^H^P"""^^^^"^^!ESC!^R ^]!!#^^\ ^O

In addition, you might want to consider adding a @ PG line for FADE in the SAM header - this will make it easier for other software packages (like ours) to understand the properties of data with which we are interacting.

jblachly commented 1 year ago

Thanks for this report. @charlesgregory is this the same issue as #22 (which is fixed)?

Also, @divonlan I believe we have included @ PG line for quite some time (#1 ), so the user may be running a very old version of FADE?

divonlan commented 1 year ago

The user says he is running v0.2.2. That's all I know. Many users use Genozip to compress their historical mountain of genomic data, so it is not unusual for us to encounter files that were generated years ago.

jblachly commented 1 year ago

The user says he is running v0.2.2. That's all I know. Many users use Genozip to compress their historical mountain of genomic data, so it is not unusual for us to encounter files that were generated years ago.

Aha; that version is 3 years old (July 2020). Currently we are at version 0.6 (https://github.com/blachlylab/fade/releases), and our recommendation is that the user upgrade.

If there are still any problems with 0.6 and the interaction with genozip, please let me know.