blachlylab / fade

Fragmentase Artifact Detection and Elimination
MIT License
11 stars 3 forks source link

ab:z tag conflicting with samfile format #19

Closed kristin-watchmaker closed 3 years ago

kristin-watchmaker commented 3 years ago

I've been having issues with the ab:z tag introducing new line characters in the sam file. This hinders downstream analysis and disallows file parsing using samtools in some instances. These discrepancies are present in the sam files (mostly kappa sonicated libraries) deposited in SRA from your article published in NAR genomics and bioinformatics. Although, none of the new line characters in these SRA files interfere with samtools.

charlesgregory commented 3 years ago

Yes it is probably due to that tag being raw base quality scores as opposed to phred-scaled. I neglected to notice this as we mostly use the bam format. I will get a fix and new release to address this soon.

charlesgregory commented 3 years ago

This issue was closed via my commit. I have made a new release with a new binary that should fix the issue. Please let me know and reopen if this doesn't fix the issue or if you have any other issues.

charlesgregory commented 3 years ago

I have also updated the docker image now in case you were using fade via docker.

kristin-watchmaker commented 3 years ago

Hi Charles

The update did not seem to fix the issue. I've uploaded a screenshot of the new line introduced into the sam file after FADE annotation (first image, line is highlighted in blue). I also noticed another issue where an rs:i:33 tag is being placed on samfile records (image 2). I don't think this is intentional? If it would be easier, I can provide the bam file in question. These two issues are showing up in ALL of my bam files (a dozen or so), but not in any your sam files deposited into SRA that I've checked (I've only checked 10 of your SRA files).

Screen Shot 2021-07-26 at 10 49 15 AM Screen Shot 2021-07-26 at 10 56 41 AM
charlesgregory commented 3 years ago

I see, that is troubling. If you could provide the original sam/bam file (or at least the portion that causes the error + the header) that produces the error that would be great!

charlesgregory commented 3 years ago

Just to clarify the tags you are seeing, the rs, am, as, ar, and ab tags are added by fade during the annotate step. rs is a binary flag and shows up as a number. These tags are then used later to determine artifact status and either clip or remove the reads. rs:i:33 would indicate a read has: softclipping, and a supplementary alignment, though no detected artifact. This description is a bit out of date but somewhat describes the tags.

kristin-watchmaker commented 3 years ago

I have attached a bam file which contains a subset of 10000 samfile lines (with headers) which produce the newline error after fade annotation.

On Wed, Jul 28, 2021 at 8:57 AM Thomas Gregory @.***> wrote:

Just to clarify the tags you are seeing, the rs, am, as, ar, and ab tags are added by fade during the annotate step. rs is a binary flag and shows up as a number. These tags are then used later to determine artifact status and either clip or remove the reads. rs:i:33 would indicate a read has: softclipping, and a supplementary alignment, though no detected artifact. This description https://github.com/blachlylab/fade/blob/master/TAGS.md is a bit out of date but somewhat describes the tags.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/blachlylab/fade/issues/19#issuecomment-888379642, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUDO5RZKTIN2PSWGEGAIR43T2ALE3ANCNFSM47ZXF7KA .

charlesgregory commented 3 years ago

I have attached a bam file which contains a subset of 10000 samfile lines (with headers) which produce the newline error after fade annotation.

If you attached it via email as a response to this github issue it won't come through, though you can post it directly on github by clicking the bottom of the comment window and attaching the file.

kristin-watchmaker commented 3 years ago

WMG_subset_10000.bam.gz

This bam file has NOT been FADE annotated yet.

charlesgregory commented 3 years ago

Using your provided bam file, I am encountering the issue with fade version v0.3.0. Though using the newest version of fade (v0.3.6) installed via conda/bioconda or the most recent binary (v0.3.1 downloaded here) doesn't produce the issue.

Are you sure you are using the most up-to-date fade version? Are you using the docker image? It's possible I may not have updated the image correctly on dockerhub.

charlesgregory commented 3 years ago

Actually just checked the docker image and it appears to be up-to-date to v0.3.1 as well.

kristin-watchmaker commented 3 years ago

I thought I was using the most up-to-date binary version published on GitHub. I'll try again! I'll close the issue for now. If I am still encountering this error, I'll re-open.