alexdobin / STAR

RNA-seq aligner
MIT License
1.77k stars 497 forks source link

issues with script soloCountMatrixFromBAM.awk #2060

Closed marlmatos closed 3 months ago

marlmatos commented 4 months ago

Hi Alex, Thanks for always being so helpful. I am trying to recreate the counts matrix of STARsolo with the soloCountMatrixFromBAM.awk that you have posted here. For context, I am filtering the AlignedSorted.out.bam for autosomes and reads that passed WASP and those that do not have the WASP SAMtAg. But i keep getting syntax errors with the awk script, and I cannot find where the error is coming from. Here is how I am using the script:

samtools view scGEX-A29.sortedByCoord.filtered.out.bam | awk -v fileWL=scGEX-A29.Solo.out/Gene/raw/barcodes.tsv \
-v fileGenes=scGEX-A29.Solo.out/Gene/raw/features.tsv \
-f soloCountMatrixFromBAM.awk | sort -k2,2n -k1,1n > scGEX-A29.matrix.mtx

And this is the error: awk: soloCountMatrixFromBAM.awk:60: Unexpected token I also redownloaded the script to make sure it was not a typo that I had introduced. But the error persists.

For reference, this is how I am running STAR

       STAR --runMode alignReads \\
        --genomeDir ${index_ch} \\
        --readFilesIn ${reverse.join( "," )} ${forward.join( "," )} \\
        --outFileNamePrefix ${meta.sample_ID}. \\
        --soloCBwhitelist <(gzip -cdf $barcode_whitelist) \\
        --soloType CB_UMI_Simple \\
        --soloFeatures Gene \\
        --soloUMIlen 12 \\
        --outSAMtype BAM SortedByCoordinate \\
        --soloCellFilter EmptyDrops_CR ${expected_cells} 0.99 10 45000 90000 500 0.01 20000 0.01 10000 \\
        --outSAMattrRGline ID:${meta.sample_ID} CN:NYGC SM:${meta.sample_ID} \\
        --readFilesCommand zcat \\
        --runDirPerm All_RWX \\
        --outWigType bedGraph \\
        --twopassMode Basic \\
        --waspOutputMode SAMtag \\
        --varVCFfile ${varVCF} \\
        --limitOutSJcollapsed 5000000 \\
        --limitSjdbInsertNsj 1500000 \\
        --outSAMattributes NH HI AS NM MD vA vG vW GX CB UB 

and another question: Is this the right way that you would go about this? I thought that using the raw/barcodes.tsv and raw/features.tsv would give me a raw matrix? But how do I get to the filtered /matrix.mtx? Shouuld I use the filtered/barcodes.tsv and filtered/features.tsv instead?

Since the soloCountMatrixFromBAM.awk was not working for me, I opted then to use the STARsolo mapped BAM as input again for STAR, for remapping. I am still dealing with errors on that part, but I figured if you can help me with this, I would not have to do that... Thanks again!

Marliette

ghuls commented 4 months ago

just looking at line 60: U[cb][ge][UB]++;, I assume that you need gawk for support for nested arrays.

samtools view scGEX-A29.sortedByCoord.filtered.out.bam | gawk -v fileWL=scGEX-A29.Solo.out/Gene/raw/barcodes.tsv \
-v fileGenes=scGEX-A29.Solo.out/Gene/raw/features.tsv \
-f soloCountMatrixFromBAM.awk | sort -k2,2n -k1,1n > scGEX-A29.matrix.mtx
alexdobin commented 4 months ago

This script is just an example of how to process the data. You may want to re-write it for your own needs, maybe in python.

marlmatos commented 4 months ago

Thanks, I will try that