10XGenomics / bamtofastq

Convert 10x BAM files to the original FASTQs compatible with 10x pipelines
MIT License
59 stars 6 forks source link

The bam data processed by bamtofastq is different from the original FASTQ file. #23

Closed Rui-Jing closed 3 years ago

Rui-Jing commented 4 years ago

Hi, Thanks for developing this powerful tool. But when I processed the BAM file from the CellRanger pipeline with bamtofastq, the results didn't match the original FASTQ sequencing data. The code is " ./bamtofastq_linux --nthreads 48 /data/jr/A15762/outs/possorted_genome_bam.bam /home/pc/A15762/"

My original FASTQ file is : image

The bam data after bamtofastq processed is : image Curiously, two folders containing FASTQ data appear.

The first folder screenshot is: image The second folder screenshot is: image

For this issue, Which FASTQ data should I use for reanalysis with CellRanger? I sincerely hope to get your help.

davemcg commented 3 years ago

Anyone from 10X can comment on this? This recently just tripped us up...according to Illumina the "001" ending is uninformative....what does it mean to 10x?

sbooeshaghi commented 3 years ago

What options are you running bamtofastq with? Could it be that the number of reads per fastq option is "splitting" the fastqs into 001... etc?

From src/main.rs:

https://github.com/10XGenomics/bamtofastq/blob/cfac1de4399fd1b7d1b03e0894ba52bb2035e304/src/main.rs#L74

If so then I think the solution would be to concatenate the 001... fastqs in order based on Read and Lane number.

(edit) alternatively you could set N really large, as suggested by @davemcg via twitter.

davemcg commented 3 years ago

We've been cating the files up to now to produce a R1 and R1 fastq pair. ....that should maintain the order. Even so...doesn't the scrambled kallisto results suggest some barcode shenanigans are happening?

These are all supposed to be the same sample....as long as the for/rev isn't getting mixed, then the order the files get cat'ed shouldn't matter?

davemcg commented 3 years ago

I've confirmed (see https://github.com/pachterlab/kb_python/issues/104) that when you run with the --reads-per-fastq flag set with a crazy big number you get the proper results.

Currently I believe there's a bug in bamtofastq (if it is spitting out N files per bam...they should generate a "proper" result if you feed them all to kallisto/etc) somehow. It seems that the barcodes are getting scrambled somehow.

In the short-term the developers should remove the --reads-per-fastq=N flag and simply output one triplet of fastq files.

pmarks commented 3 years ago

The final group of numbers in the fastq filename is the 'chunk' number. You get a new chunk for each group of N reads, where you can set N with --reads-per-fastq=N. Key point: you need to make sure that (You can make the old version Illumina bcl2fastq tool do this chunking, which is why we do it).

@davemcg You need to make sure that you cat together the R1, R2 in the same order as defined by the chunk number. The order is arbitrary, but it needs to be the same for both files. Does that make sense?

On the original question @Rui-Jing: the different folders correspond to different lanes in the original flowcell, which get packed into the BAM file with distnict BAM read groups. bamtofastq outputs the reads in different folders for each read group. You should pass both of the FASTQ paths to Cellranger, and it will use all the reads.

davemcg commented 3 years ago

If you follow the issue I placed with the kallisto developers, it REALLY APPEARS that the barcodes (at least in my example) are getting jumbled. I understand that kallisto working isn't a priority of yours, but I'd appreciate an explanation (full code chunks are given) of why running bamtofastq on a 10x bam (https://sra-download.ncbi.nlm.nih.gov/traces/sra43/SRZ/011680/SRR11680523/MouseACS6.bam), then running the fastq file pairs (NOT CONCATENATED) results in "mixing" with kalllisto.

https://github.com/pachterlab/kb_python/issues/104#issuecomment-790866635

pmarks commented 3 years ago

@davemcg probably the best sanity check is to see if read header (the line that start with @) is matched line-by-line between the corresponding R1 and R2 files. 'Corresponding' R1 and R2 files have the same filename, except for the R1 or R2 portion.

For example bamtofastq_s1_L2_R1_003.fastq and bamtofastq_s1_L2_R2_003.fastq always need to 'stay together' throughout the analysis -- so if you're passing files into Kallisto, you need to make sure they match up this way, or you will indeed see scrambling. (Your workaround of dumping everything into one file is great, because it means you don't need to worry about 'matching up' the FASTQs). Does that make sense?

If there's a bug in bamtofastq it will most likely show up as mismatched read headers in corresponding FASTQ files. Do you observe that?

davemcg commented 3 years ago

Command run:

kb count -t 12 -i index.idx -g t2g.txt -x 10xv2 -o kb_standard --workflow standard --filter bustools \
fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R1_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R2_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R1_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R2_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R1_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R2_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R1_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R2_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R1_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R2_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R1_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R2_002.fastq.gz

kb count takes the input as matched pairs. All space separated. So f1_r1_001.fq.gz f1_r2_001.fq.gz f1_r1_002.fq.gz f1_r2_002.fq.gz is how you would put in two pairs of files.

However, this invocation resulted in scrambled data (fwiw this also happens in alevin, a pseudoaligner from a different group).

When I ran the "001" and "002" pairs independently in kb count...the results are fine.

This suggests that the pairs are being set up properly. Otherwise kb count should return 0 (or near 0) pseudoaligned results.

Here the first header 10 line from the last pair of fastq files:

$ zcat fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R1_002.fastq.gz  | head -n 100 | grep "^@" | head -n 10
@E00173:630:H7MJCCCXY:3:1220:16407:27029 1:N:0:0
@E00173:630:H7MJCCCXY:3:1220:16457:27714 1:N:0:0
@E00173:630:H7MJCCCXY:3:2215:6796:8236 1:N:0:0
@E00173:630:H7MJCCCXY:3:2122:16143:13597 1:N:0:0
@E00173:630:H7MJCCCXY:3:2202:5112:31265 1:N:0:0
@E00173:630:H7MJCCCXY:3:1204:27316:34043 1:N:0:0
@E00173:630:H7MJCCCXY:3:1219:14773:17694 1:N:0:0
@E00173:630:H7MJCCCXY:3:2108:5152:11998 1:N:0:0
@E00173:630:H7MJCCCXY:3:1123:8288:1713 1:N:0:0
@E00173:630:H7MJCCCXY:3:1124:26666:2540 1:N:0:0
$ zcat fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R2_002.fastq.gz  | head -n 100 | grep "^@" | head -n 10
@E00173:630:H7MJCCCXY:3:1220:16407:27029 3:N:0:0
@E00173:630:H7MJCCCXY:3:1220:16457:27714 3:N:0:0
@E00173:630:H7MJCCCXY:3:2215:6796:8236 3:N:0:0
@E00173:630:H7MJCCCXY:3:2122:16143:13597 3:N:0:0
@E00173:630:H7MJCCCXY:3:2202:5112:31265 3:N:0:0
@E00173:630:H7MJCCCXY:3:1204:27316:34043 3:N:0:0
@E00173:630:H7MJCCCXY:3:1219:14773:17694 3:N:0:0
@E00173:630:H7MJCCCXY:3:2108:5152:11998 3:N:0:0
@E00173:630:H7MJCCCXY:3:1123:8288:1713 3:N:0:0
@E00173:630:H7MJCCCXY:3:1124:26666:2540 3:N:0:0

(thanks for your help - I am very open to the fact I could have borked this....but I haven't out why yet - the only thing that seems to fix my issue is to prevent bamtofastq from creating multiple files)

teng-gao commented 3 years ago

Hi I still don't understand why there are two folders? I've been recently experiencing the same issue

pmarks commented 3 years ago

@teng-gao there is a separate output folder for each flowcell and lane of original FASTQs. You need to make sure that the R1/R2 pair within each directory stay properly paired together. Going to close this issue, since we were able to validate that the fastqs were properly paired by manual inspection.

howtofindme commented 1 year ago

Hi, Thanks for developing this powerful tool. But when I processed the BAM file from the CellRanger pipeline with bamtofastq, the results didn't match the original FASTQ sequencing data. The code is " ./

hi,have you figured it ou? which folder do you use to do cellranger count?

howtofindme commented 1 year ago

@teng-gao there is a separate output folder for each flowcell and lane of original FASTQs. You need to make sure that the R1/R2 pair within each directory stay properly paired together. Going to close this issue, since we were able to validate that the fastqs were properly paired by manual inspection.

hi, I have the sampe problems, which folder do you use to do cellranger count?

pmarks commented 1 year ago

@howtofindme if you get multiple output folders from bamtofastq, you should use all of them to re-run cellranger. You can specify a comma-separated list of paths to the --fastqs argument (see docs here). They are usually in different paths if the original fastq data originated from multiple flowcell runs.

howtofindme commented 1 year ago

@howtofindme if you get multiple output folders from bamtofastq, you should use all of them to re-run cellranger. You can specify a comma-separated list of paths to the --fastqs argument (see docs here). They are usually in different paths if the original fastq data originated from multiple flowcell runs.

thank very much! I benefit a lot @pmarks