10XGenomics / bamtofastq

Convert 10x BAM files to the original FASTQs compatible with 10x pipelines
MIT License
59 stars 6 forks source link

Strange read-splitting behaviour #184

Open apredeus opened 6 months ago

apredeus commented 6 months ago

Dear bamtofastq developer team,

I recently came across a very interesting behaviour. I am trying to reprocess a public dataset that consists of 22 10x GEX runs (I've checked and I'm pretty positive that none of those are ATAC etc). Here is the link to the dataset:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE138669

SRA has failed to recognise the "technical" R1 read, so they have made the submitter's 10x BAM files available. However, upon running the latest (v1.4.1) bamtofastq (each job has completed successfully etc), I have discovered that samples were split into two big groups. Group 1 (GSM4115877-GSM4115889) has generated normal index, technical R1 (26 bp), and biological R2 (98 bp). However, Group 2 (GSM4115868-GSM4115876) has generated 4 reads: index, R1 which is a biological read (98 bp), R2 containing cell barcode (16 bp), and R3 containing UMI (10 bp).

GSM4115868 I1 R1 R2 R3
GSM4115869 I1 R1 R2 R3
GSM4115870 I1 R1 R2 R3
GSM4115871 I1 R1 R2 R3
GSM4115872 I1 R1 R2 R3
GSM4115873 I1 R1 R2 R3
GSM4115874 I1 R1 R2 R3
GSM4115875 I1 R1 R2 R3
GSM4115876 I1 R1 R2 R3
GSM4115877 I1 R1 R2
GSM4115878 I1 R1 R2
GSM4115879 I1 R1 R2
GSM4115880 I1 R1 R2
GSM4115881 I1 R1 R2
GSM4115882 I1 R1 R2
GSM4115883 I1 R1 R2
GSM4115884 I1 R1 R2
GSM4115885 I1 R1 R2
GSM4115886 I1 R1 R2
GSM4115887 I1 R1 R2
GSM4115888 I1 R1 R2
GSM4115889 I1 R1 R2

All BAM tags/headers appear to be the same, even made by the same version of Cell Ranger (v3 I think).

SRR10254548.bam AS BC CB CR CY HI li NH nM QT RE RG UB UR UY
SRR10254549.bam AS BC CB CR CY HI li NH nM QT RE RG UB UR UY xf
..............
SRR10254569.bam AS BC CB CR CY HI li NH nM QT RE RG UB UR UY xf

Do you know what is causing it, and I can I fix it?

For your convenience, here are some (NCBI) links to an "offending" and a "normal-behaving" BAM files:

bad BAM: https://sra-pub-src-2.s3.amazonaws.com/SRR10254550/SC4possorted_genome_bam.bam.1 good BAM: https://sra-pub-src-2.s3.amazonaws.com/SRR10254567/SC185possorted_genome_bam.bam.1

Thank you in advance!

-- Alex

apredeus commented 6 months ago

OK I realized now that those are samples done with v1 chemistry. Is there a way to run bamtofastq to produce a normal pair of R1/R2 files, or do I have to combine them using some custom script?

Thank you in advance!

mortunco commented 3 months ago

Hi. Same here for the same issue. Perturbseq (dixit et al ). Thanks in advance.

RickyLau0910 commented 1 month ago

Hi, here is a guidance from 10x Genomics on ‘‘How to format v1 chemistry datasets to work with current Cell Ranger versions?’’ which is helpful to me. https://kb.10xgenomics.com/hc/en-us/articles/360043386291-How-to-format-v1-chemistry-datasets-to-work-with-current-Cell-Ranger-versions