GreenleafLab / ArchR

ArchR : Analysis of Regulatory Chromatin in R (www.ArchRProject.com)
MIT License
384 stars 137 forks source link

Getting stuck counting barcodes #1885

Closed GrafZahl1234 closed 1 year ago

GrafZahl1234 commented 1 year ago

Hi all,

during creation of ArrowFiles, my PC gets stuck counting barcodes. One core just runs at full power but nothing changes. Neither is the Log updated beyond what I attached, nor does the ArrowFile increase in size. The temp arrow file in tmp is of a fitting size (roughly the size of the input bam). The issue was reported before, but was closed by the poster without sharing his solution (Issue301). I have tried updating ArchR, but looks like I m up to date (sha1 has not changed since last install). My function call was:


  filterTSS <- 4
  filterFrags <- 1000
  sampleNames <- "skin_late_anagen"
  ATAC_bam <- "SRR10428396_skin.late.anagen.atac.bam"
  addArchRGenome("mm10")

 ArrowFiles <- createArrowFiles(
    inputFiles = c(ATAC_bam),
    sampleNames = sampleNames,
    filterTSS = filterTSS, #Dont set this too high because you can always increase later
    filterFrags = filterFrags,
    threads = 1,
    addTileMat = TRUE,
    addGeneScoreMat = TRUE
)

Data is from SRA. According to the link assembly is GRCm38.p6 C57BL/6J so "mm10" should be correct? I wonder if the problem is ".bamToTmp Barcodes-Chunk-(1 of 105)-chr1:1-39094394, Class = NULL". Class is character with the toturial data.

Thank you in advance!

Additional context Add any other context about the problem here. ArchR-createArrows-54b2788857a8-Date-2023-03-17_Time-15-43-06.log

rcorces commented 1 year ago

Hi @GrafZahl1234! Thanks for using ArchR! Please make sure that your post belongs in the Issues section. Only bugs and error reports belong in the Issues section. Usage questions and feature requests should be posted in the Discussions section, not in Issues.
Before we help you, you must respond to the following questions unless your original post already contained this information: 1. If you've encountered an error, have you already searched previous Issues to make sure that this hasn't already been solved? 2. Can you recapitulate your error using the tutorial code and dataset? If so, provide a reproducible example. 3. Did you post your log file? If not, add it now. 4. Remove any screenshots that contain text and instead copy and paste the text using markdown's codeblock syntax (three consecutive backticks). You can do this by editing your original post.

GrafZahl1234 commented 1 year ago

I got a little further: When breaking down the single bam into chromosome-sized subfiles the count step gets finished in reasonable time, although without finding cells (see below). So maybe its a problem that counting takes excessively long on large datasets (96 gb for the full file). Is it possible to work with such broken down files? Because the same cell barcodes will be present in every single sub-file.

A second thing seems to be that the barcoding is odd in the file or maybe I just dont find the correct barcode tag. Opening it in Rsamtools does not show me the correct barcodes, just the very long ones seen in the ArchR log. But if I transfor bam to sam I can actually see the correct RG.

This is a line in the Sam A00794:88:HFMNWDRXX:2:1136:31494:19774_R1.07,R2.39,R3.67,P1.05 147 chrY 90844636 0 48M = 90843877 -807 GACGTTTTTGTTACGTAGTGAACTCTCCATAGTGGTAGCACGCTGAGC FF,F:FF,:F,F,,,,FFFF,,FFFF,:,FFF,F:F,:,,F,,:FF:FAS:i:-22 XS:i:-22 XN:i:0 XM:i:6 XO:i:0 XG:i:0 NM:i:6 MD:Z:0A4A1A2G4A16A15 YS:i:0 YT:Z:CP RG:Z:R1.07,R2.39,R3.67,P1.05

This is what Rsamtools considers the qname A00794:88:HFMNWDRXX:2:1136:31494:19774_R1.07,R2.39,R3.67,P1.05

This is the correct Barcode, which does not show up in RSamtools (or I am to supid to find it): R1.07,R2.39,R3.67,P1.05

I guess with the qname as Barcode the barcode table gets to a problematic size. So this thread turned from "possible bug" to "noob question about bam". How do I set the correct RG in ArchR? bcTag = "RG:Z:"? Or do I have to edit my sam/bam files somehow? Sam is basically text, so should work, and it seems like RSamtools cant write manipulated Bams.

Many thanks!

GrafZahl1234 commented 1 year ago

OK, while I found no elegant way to change qnames, I wrote a python script to solve the issue. I will post it here so the next person with the same Issue hopefully finds it helpful. Now it looks like Arrow files are created.

from pysam import AlignmentFile
import os

all_files = os.listdir(myworkdir)
bam_files = [file for file in all_files if file.endswith(".bam")]

for bf_name in bam_files:

    print("Next file: "+ bf_name)
    bf = AlignmentFile(bf_name, "rb") # rb: readbam?
    out_name = bf_name.replace(".bam", "_qnameupdate.bam")
    outbam = AlignmentFile(out_name, "wb", template = bf) # wb: writebam?
    print("creating: "+ out_name)

    readcount = 0

    for read1 in bf.fetch(until_eof=True):

        # here i just wanted the last 23 letters of the full qname
        read1.query_name = read1.query_name[-23:]
        outbam.write(read1)

        readcount += 1
        if readcount % 5000000 == 0:
            print(str(readcount) + " reads done")

    print("Done with " + out_name , ", closing files.")
    bf.close()
    outbam.close()
rcorces commented 1 year ago

@GrafZahl1234 - apologies for the delay in my reply. I think a thorough read of the function parameters for createArrowFiles() should provide most of what you were originally looking for. In particular:

bcTag | The name of the field in the input bam file containing the barcode tag information. See ScanBam in Rsamtools. -- | -- gsubExpression | A regular expression used to clean up the barcode tag string read in from a bam file. For example, if the barcode is appended to the readname or qname field like for the mouse atlas data from Cusanovic* and Hill* et al. (2018), the gsubExpression would be ":.*". This would retrieve the string after the colon as the barcode.