cxzhu / Paired-Tag

Analysis of Paired-Tag datasets
MIT License
39 stars 15 forks source link

Output 10X compatible BAM files #18

Closed kaizhang closed 8 months ago

kaizhang commented 2 years ago

Hi @cxzhu,

@Xieeeee and I are trying to streamline the analysis pipeline for Paired-Tag data. The ideal workflow for Paired-Tag I have in mind is:

  1. Use reach-tools to analyze the raw data (FASTQ) and generate aligned and barcode-included bam files.
  2. Start with the bam files, and use SnapATAC2 to do all downstream analysis.

SnapATAC2 is now capable of converting raw bam files (unfiltered, unsorted) into fragment files, which can then be used by SnapATAC2 to generate count matrices for both RNA and ATAC, and perform downstream analyses.

I think so far we are able to use the bam files generated by reach-tools for this purpose (see kaizhang/SnapATAC2#75). But it would be better if reach-tools can output bam files that are similar to 10X pipeline's output:

  1. Store cell barcode and UMI using BAM tags. This is advantageous because read names are not the best place to store this information. Besides, read names are normally stripped away by GEO. In particular, 'CB" is used to store corrected barcodes and 'CR' is used to store original barcode sequences.
  2. Retain all reads in the BAM files, so that BAM files contain all the raw information from FASTQ files. People can directly use those BAM files if they want to access the raw data (no need to request the FASTQ files). And this also makes it easier for depositing data to a public database (we only need to upload the BAM files).
Xieeeee commented 2 years ago

🙋‍♂️ This is a good idea. Sometimes I also found our format and 10X format incompatible. In some cases I manually change the barcode and umi location with pysam, but it takes long time for reprocessing big file and is not efficient. Output compatible with 10X format would be better.

My understanding: 1: I assume reachtool put barcodes in read name at the very beginning because this is easier to handle as reachtool read bam file in as plain text during processing. We can also add it in barcode tag, I will try it. 2: We removed unaligned reads in DNA and retain only primary alignment in RNA currently during pre-processing. This is done by bowtie2 / STAR independent of reachtool. We can certainly keep them

cxzhu commented 2 years ago

Hi @kaizhang,

This is a good idea. Yes, as @Xieeeee pointed out, I use reachtools to put barcodes in read name at very begining so bowtie/STAR can retain this information. We can add one additional step (such as with pysam) to convert the format to 10x compatible. The 2nd point @Xieeeee mentioned is also correct, we can certainly keep them by fine-tuning the parameters of mappers.

Xieeeee commented 1 year ago

Following up on this. Currently, I have a script to add the barcodes and UMI to create a 10X format file, we can include this in the pipeline. I think bamtools probably can also do this in c++ to make it more seamless, but I don't know how to implement it now. PT2TXG.txt

Gavin-Lijy commented 1 year ago

Following up on this. Currently, I have a script to add the barcodes and UMI to create a 10X format file, we can include this in the pipeline. I think bamtools probably can also do this in c++ to make it more seamless, but I don't know how to implement it now. PT2TXG.txt

Thank you so much!

Xieeeee commented 8 months ago

Update on 240405: as now snapatac2.pp.make_fragment_file supports create fragments from bam files by matching barcode and umi in read name (using barcode_regex and umi_regex, I am closing this one since there is no need to create 10X compatible bam files