Closed kaizhang closed 8 months ago
🙋♂️ This is a good idea. Sometimes I also found our format and 10X format incompatible. In some cases I manually change the barcode and umi location with pysam, but it takes long time for reprocessing big file and is not efficient. Output compatible with 10X format would be better.
My understanding: 1: I assume reachtool put barcodes in read name at the very beginning because this is easier to handle as reachtool read bam file in as plain text during processing. We can also add it in barcode tag, I will try it. 2: We removed unaligned reads in DNA and retain only primary alignment in RNA currently during pre-processing. This is done by bowtie2 / STAR independent of reachtool. We can certainly keep them
Hi @kaizhang,
This is a good idea. Yes, as @Xieeeee pointed out, I use reachtools to put barcodes in read name at very begining so bowtie/STAR can retain this information. We can add one additional step (such as with pysam) to convert the format to 10x compatible. The 2nd point @Xieeeee mentioned is also correct, we can certainly keep them by fine-tuning the parameters of mappers.
Following up on this. Currently, I have a script to add the barcodes and UMI to create a 10X format file, we can include this in the pipeline. I think bamtools probably can also do this in c++ to make it more seamless, but I don't know how to implement it now. PT2TXG.txt
Following up on this. Currently, I have a script to add the barcodes and UMI to create a 10X format file, we can include this in the pipeline. I think bamtools probably can also do this in c++ to make it more seamless, but I don't know how to implement it now. PT2TXG.txt
Thank you so much!
Update on 240405: as now snapatac2.pp.make_fragment_file
supports create fragments from bam files by matching barcode and umi in read name (using barcode_regex
and umi_regex
, I am closing this one since there is no need to create 10X compatible bam files
Hi @cxzhu,
@Xieeeee and I are trying to streamline the analysis pipeline for Paired-Tag data. The ideal workflow for Paired-Tag I have in mind is:
SnapATAC2 is now capable of converting raw bam files (unfiltered, unsorted) into fragment files, which can then be used by SnapATAC2 to generate count matrices for both RNA and ATAC, and perform downstream analyses.
I think so far we are able to use the bam files generated by reach-tools for this purpose (see kaizhang/SnapATAC2#75). But it would be better if reach-tools can output bam files that are similar to 10X pipeline's output: