genome / bam-readcount

Count bases in BAM/CRAM files
MIT License
298 stars 95 forks source link

NUL (\x00, ^@) and other control characters in output #107

Open Kaddea opened 2 months ago

Kaddea commented 2 months ago

Hi,

I've using the bam_readcount wrapper "mgibio/bam_readcount_helper-cwl". The output files (snv or indel) contain control characters which cannot be processed by the vcf_readcount_annotator.

Which substitution of the control characters are suitable for further processing?

Variation (vcf) 20 405939 . TTTC T . weak_evidence AS_FilterStatus=weak_evidence;AS_SB_TABLE=0,0|0,0;DP=1;ECNT=1;GERMQ=23;MBQ=0,32;MFRL=0,204;MMQ=60,60;MPOS=43;POPAF=7.3;TLOD=4.21;CSQ=-|upstream_gene_variant|MODIFIER|RBCK1|ENSG00000125826|Transcript|ENST00000356286.10|protein_coding|||||||||||2357|1||HGNC|HGNC:15864|1||| GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:0,1:0.667:1:0,1:0,0:0,1:0,0,1,0

bam_readcount output (indel) 20 405940 N 1 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 G:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 -^@^@^@:1:255.00:0.00:0.00:1:0:0.88:0.03:0.00:1:0.42:101.00:0.42

chrisamiller commented 2 months ago

Weird. We have processed lots of bams through this type of workflow and I've never seen anything like that. Happy to take a look though. Can you provide a tiny example bam with the steps needed to recreate the problem?

Kaddea commented 2 months ago

Thanks for your help!! I've cropped one of the bam files and the corresponding vcf file (both from RNAseq reads) to reproduce the readcount output files. The strange characters in the output files appear only from column 11 on, and it seems only at sites with varying deletions (2-5 bases).
The files (bam, vep-annotated vcf and the snv/indel tsv) can be downloaded from https://kaddea.com/s/J76BAJsg4d5zytN (approx. 45 MB) Sequence alignment and variant analysis based on Ensembl GRCh38, release 110.

Best, Mathias

chrisamiller commented 2 months ago

Thank you. Can you also provide the exact commands that were used, along with software versions, etc - just trying to reproduce it on our end here.

Kaddea commented 1 month ago

read_count_pipeline.txt Hmmm ... the attached file indicates the steps for alignment, variant calling, annotation and preparation for the read counts. I've omitted the mandatory parameters (like input/output, etc.). Hope it helps ... btw.: truncating the read-count output files to the first 10 columns helps to proceed with the vcf annotation, but I'm not sure about the validity of the resulting files ... Mathias