alexdobin / STAR

RNA-seq aligner
MIT License
1.86k stars 506 forks source link

Filter UMIs based on read counts #1257

Open rcosentino opened 3 years ago

rcosentino commented 3 years ago

Hi Alex,

We would like to filter UMIs based on the amount of reads "supporting" them, is there any option to do it integrated into STARsolo? I could not find it. We were hoping to do it from the bam file, but until now we are not being able to re-create the raw matrix from the bam file. I read a previous question going in the same direction and you offer to share your script to re-create the matrix from the bam file, could you share it with us?

Thanks,

Raúl

alexdobin commented 3 years ago

Hi Raúl

I just pushed the script soloCountMatrixFromBAM.awk into GitHub master branch.

samtools view Aligned.sortedByCoord.out.bam | awk -v fileWL=Solo.out/Gene/raw/barcodes.tsv -v fileGenes=Solo.out/Gene/raw/features.tsv -f /path/to/extras/scripts/soloCountMatrixFromBAM.awk | sort -k2,2n -k1,1n > mat.mtx

You need to add GX, CB and UB tags to the --outSAMattrbiutes. Is pretty slow and can use a lot of memory, so I recommend trying it out on a small run first, 100k-1M reads.

Cheers Alex

marlmatos commented 9 months ago

Hi Alex, I have been reading all of you comments about these sort of issues. I think I have a general idea but I am still confused. In my case, I cannot use the standard filtered count matrix from star solo because I want to pre-filter the bam file for reads that dont have the wasp tag vW and those passed vW==1 and also keep only autosomes. I localized the script soloCountMatrixFromBAM.awk and I am also adding the tags GX, CB and UB tags to the --outSAMattrbiutes in the alignment. Having said that, my coding abilities are limited. How can I then get the barcodes.tsv and features.tsv from the filtered bam? I assume that after I create this raw barcodes.tsv , features.tsv and matrix.txt, I should then pass these thru the soloBasicCellFilter.awk to get the filtered barcodes like the original star solo output?

alexdobin commented 9 months ago

Hi @matosmr

The barcodes.tsv is the full list of barcodes, and features.tsv is the full list of genes, so you can copy them from the STARsolo run - no need to get it from the BAM.

marlmatos commented 9 months ago

Hi Alex, thanks for the clarification. I am running into problems with the script soloCountMatrixFromBAM.awk. I am getting the following errors /gs/gsfs0/home/marlrodrig/aging_project/scRNAseq/scripts/AS_scrnaseq_preprocessing_v2/soloCountMatrixFromBAM.awk: line 6: syntax error near unexpected token tag' /gs/gsfs0/home/marlrodrig/aging_project/scRNAseq/scripts/AS_scrnaseq_preprocessing_v2/soloCountMatrixFromBAM.awk: line 6:function getTag(tag)'