alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

STARsolo for paired end + CellBC/UMI #558

Closed Gvaihir closed 4 years ago

Gvaihir commented 5 years ago

Hi Alex,

Not really an issue, but rather a question - is there a chance to pass 3 fastq files to STARsolo to analyze 2 paired end reads and 1 cellBC+UMI?

Thanks ahead! Cheers, G

alexdobin commented 5 years ago

Hi Gvaihir,

no, this is impossible at the moment. If there a scRNA-seq protocol that produces this kind of data. In principle, I can implement such a configuration.

Cheers Alex

Gvaihir commented 5 years ago

Hi Alex, Thanks for reply! Sorry I forgot to specify, I was thinking about 10x ATAC-seq pipeline. Would that be possible?

Thanks! G

alexdobin commented 5 years ago

Hi Gvaihir,

scATAC-seq pipeline is more complicated than scRNA-seq. Instead of genes, we have peaks, which have to be called based on the data itself, and also each peak in each cell has very few reads. So the output of STARsolo has to be compatible with downstream peak calling etc., which makes it a much harder task.

Cheers Alex

Gvaihir commented 5 years ago

Hi Alex, I understand that. How about prepping just a set of bam files for samples separately, and for the entire data in parallel. Then do peak calling and counts with third party packages. Would you assume that’s still might be faster than cellranger? To test that it would be great to get STARsolo working with paired end data. Thanks! Cheers, G

alexdobin commented 5 years ago

Hi Gvaihir

I have not run CellRanger pipeline, so not sure which step is the most time consuming in their implementation. I could add an option of processing the cell barcodes only (with 1 mismatch error correction) and writing the CB cell barcode tag into the BAM file. Then the downstream pipeline needs to be aware of this tag, so that the reads can be counted towards different cells.

Cheers Alex

Gvaihir commented 5 years ago

Hi Alex,

Hmmm I think it's a great idea! Let me know if there is any additional info I could provide.

Looking forward to this update!

Cheers, G

MarleyCodes commented 5 years ago

Hi Alex,

I would also benefit from being able to run STARsolo with paired-end data. Are you still planning to implement this functionality? If so, any idea when you can implement this?

Sincerely,

Marley

alexdobin commented 5 years ago

Hi Marley,

I am planning to do it, however, the processing depends on the actual protocol, such as scATAC-seq. What application do you have in mind?

Cheers Alex

tingchiafelix commented 5 years ago

Hi Alex, I'd like to follow up this question. In our end, we're trying to replace the cellranger count by STARSolo becasue STARsolo might run faster than cellranger. However, we're not able to pass 2 paired end reads and 1 cellBC+UMI through STARSolo as Gvaihir mentioned above. I'm wondering are you guys still doing this?

Best, TC

alexdobin commented 5 years ago

Hi TC,

I am planning to do this, but I need to understand how the sequenced reads look like in your specific application.

Cheers Alex

tingchiafelix commented 5 years ago

Hi Alex, Thanks for your response. Attached is the zip file that compresses three sample fastq files.

Best, TC fastq.zip

alexdobin commented 5 years ago

Hi All,

the ability to process paired-end reads and barcode read has been implemented in soloDevelop branch. Necessary options: --outSAMtype BAM Unsorted (and/or SortedByCoordinate) --outSAMattributes NH HI CR CB (and any other tags) --soloType CB_samTagOut
--soloCBmatchWLtype 1 --soloCBwhitelist /path/to/whitelist (not different list for ATAC-seq:737K-cratac-v1.txt) --readFilesIn cDNAread1 cDNAread2 barcodeRead

As far as I understand, for 10X ATAC-seq, you would use --readFilesIn R1 R3 R2 To prohibit splicing also need --alignIntronMax 1 --alignMatesGapMax 1000

Cheers Alex

Gvaihir commented 4 years ago

Hey Alex, I'm onto testing the paired end feature for scATAC - thanks so much for updating! Could you clarify a couple of things? 1) --soloType CB_samTagOut - from the source seems like it just looks on the CB tag, ignoring UMI (which is absent in scATAC data) 2) If the UMI is absent - is deduplication performed by any means? E.g. Cell Ranger collapses all reads with identical R1 and R2 pairs.

Gvaihir commented 4 years ago

Hi Alex, I have run STARSolo from soloDevelop repo on my scATAC data. The run finished ok, generated BAM, logs and some stats, but Solo.out contains no count matrices, only Barcodes.stats. Logs report successfully finished pipeline, no errors. Arguments:

STAR --genomeDir $genomeOut \
    --readFilesIn $r1 $r2 $rbc \
    --readFilesCommand zcat \
    --soloType CB_samTagOut \
    --soloCBwhitelist $whitelist \
    --soloCBmatchWLtype 1MM \
    --outSAMattributes NH HI CR CB \
    --outSAMtype BAM SortedByCoordinate \
    --outFileNamePrefix $out

Any suggestions deeply appreciated!

Also if you could comment on my earlier 2 questions.

Cheers, Gvaihir

alexdobin commented 4 years ago

Hi Gvaihir,

STARsolo does not yet generate the counts for scATAC. At the moment, it only maps the reads and adds the CB tag with cell barcode sequence. You would need to use downstream tools to get the quantification for scATAC. For instance, with snapATAC you need to transform the SAM/BAM file in the following way: https://github.com/r3fang/SnapATAC/wiki/FAQs#cellranger_output

Cheers Alex

Gvaihir commented 4 years ago

I see! thanks Alex! Closing the issue at this point.

ghuls commented 4 years ago

@alexdobin

I tried the last git version of STAR and tried to map a sample with 3 FASTQ files.

I got a segfault. After shortening the FASTQ files, he mapping didn't crash when taking the first 502623 reads, but did crash with the first 502624 reads.

I then took the last 3 reads of this 502624 reads FASTQ file and made the testcase below with the 3 reads which reproduce the segfault.

/staging/leuven/stg_00002/lcb/ghuls/software/STAR/source/STAR \
    --genomeDir $genomedir \
    --readFilesCommand zcat \
    --runMode alignReads \
    --runThreadN 1 \
    --readFilesIn /tmp/star_atac_R{1,3,2}.short.fastq.gz \
    --soloType CB_samTagOut \
    --soloCBwhitelist $whitelist \
    --soloCBmatchWLtype 1MM \
    --soloBarcodeReadLength 0 \
    --outSAMattributes NH HI AS nM CB CR CY \
    --outSAMtype BAM SortedByCoordinate \
    --outFileNamePrefix star_atac

Jul 07 14:52:18 ..... started STAR run
Jul 07 14:52:18 ..... loading genome
Jul 07 14:52:34 ..... started mapping
Segmentation fault
# FASTQ files.
❯ zcat /tmp/star_atac_R1.short.fastq.gz
@A01044:19:HLYKFDRXX:1:2117:21386:30264 1:N:0:ACTCAGAC
GTGCTTCCTGGTGCCCAGGTTTCTCACAGCAGCAGCTTCTCTTGCTGTG
+
F,FFFFFFFFFFFFFFFF:FFFFFFFFF:FFFFFF:F:F:FFFFFFFFF
@A01044:19:HLYKFDRXX:1:2117:24189:30264 1:N:0:ACTCAGAC
GTATCAGGAACCTTAGGGGATGGGCCTCCAGACTTGCTTGTAGGAGGTA
+
FFFFFFFFFFFF:,F,F:FFFFFFFFFF,FFFFFFFFFFFFFFFFFFFF
@A01044:19:HLYKFDRXX:1:2117:27932:30264 1:N:0:ACTCAGAC
GGTCTGGAGAGGGGCGGAGCCTTTCTCCCAACAAGCGAGTCAGAAAGAG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

❯ zcat /tmp/star_atac_R2.short.fastq.gz
@A01044:19:HLYKFDRXX:1:2117:21386:30264 2:N:0:ACTCAGAC
TCAAGACTCCCGGGTA
+
FFF:F,FFFFFFFFF:
@A01044:19:HLYKFDRXX:1:2117:24189:30264 2:N:0:ACTCAGAC
CGCACAGAGGAGACTC
+
FFFFFFFFFFF:FF,F
@A01044:19:HLYKFDRXX:1:2117:27932:30264 2:N:0:ACTCAGAC
TTACTTGTCGATGCAT
+
FFFFFFFFFFFFFFFF

❯ zcat /tmp/star_atac_R3.short.fastq.gz
@A01044:19:HLYKFDRXX:1:2117:21386:30264 3:N:0:ACTCAGAC
CATGGGTTCTATCCCTGGTCCAGGAAGATCTCACATGTTGCAAGGCAA
+
FF,:FFFFFFFFFF:FF:FFF,FFFFFFFFFFFF::F,FFFFFFFFFF
@A01044:19:HLYKFDRXX:1:2117:24189:30264 3:N:0:ACTCAGAC
GACATTCATCATTTACCTTTCTCTACTCATGGGTTTGAGACGTTAAAG
+
F:F:FFFFFFFF,FF:FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF
@A01044:19:HLYKFDRXX:1:2117:27932:30264 3:N:0:ACTCAGAC
CGTCACCTCTTACATAATAAAAAGATACCAGAAAGGCTCCGCCCCTCT
+
FFFFFF:FFFFFF,FF:FFFF,,F,F,:,FFFF,:,FFFFF:FFFFFF
alexdobin commented 4 years ago

Hi Gert,

I mapped these reads to a human genome and did not get the segfault? Could you send me FASTA, GTF and parameters you used to create the genome?

Thanks! Alex

fderop commented 4 years ago

Hi @alexdobin,

Answering in place of Gert, who is helping me on this issue. I had not realised that the unorthodox manner in which the genome was generated could be the source of our issues. The STAR 2.7.5 genome was generated using the genome.fa and regulatory.gff from the 10x Cellranger-atac 1.2.0 GRCh38 reference:

curl -O https://cf.10xgenomics.com/supp/cell-atac/refdata-cellranger-atac-GRCh38-1.2.0.tar.gz

Since our dataset is paired-end ATAC, I wanted to see if STARsolo could count reads in regulatory features. I therefore modified the regulatory.gff file supplied in the 10x reference by replacing all feature names by the generic name regulatory:

awk -F '\t' -v 'OFS=\t' '{$3 = "regulatory"; print $0;}' /ddn1/vol1/staging/leuven/stg_00002/lcb/fderop/data/00000000_res/GrCh38_STAR_2.7.5_atac/regulatory.gff > /ddn1/vol1/staging/leuven/stg_00002/lcb/fderop/data/00000000_res/GrCh38_STAR_2.7.5_atac/regulatory_modified_names.gff

The genome was then generated as follows, using the 10x genome.fa and modified regulatory.gff:

/staging/leuven/stg_00002/lcb/ghuls/software/STAR/source/STAR \
    --runThreadN 24 \
    --runMode genomeGenerate \
    --genomeDir /ddn1/vol1/staging/leuven/stg_00002/lcb/fderop/data/00000000_res/GrCh38_STAR_2.7.5_atac \
    --genomeFastaFiles /ddn1/vol1/staging/leuven/stg_00002/lcb/fderop/data/00000000_res/GrCh38_STAR_2.7.5_atac/genome.fa \
    --sjdbGTFfile /ddn1/vol1/staging/leuven/stg_00002/lcb/fderop/data/00000000_res/GrCh38_STAR_2.7.5_atac/regulatory_modified_names.gff \
    --sjdbGTFfeatureExon regulatory \
    --sjdbOverhang 100

Then, STAR was ran as follows:

r1="fastq/BAP__1d7912__10x_scATAC_PBMC_3030663_3030456_replicate_1_S2_L001_R1_001.fastq.gz";
r2="fastq/BAP__1d7912__10x_scATAC_PBMC_3030663_3030456_replicate_1_S2_L001_R2_001.fastq.gz";
r3="fastq/BAP__1d7912__10x_scATAC_PBMC_3030663_3030456_replicate_1_S2_L001_R3_001.fastq.gz";
out="demultiplexed/vib1";
whitelist="/ddn1/vol1/staging/leuven/stg_00002/lcb/fderop/data/00000000_res/10x_atac_wl/737K-cratac-v1.txt";
genomedir="/ddn1/vol1/staging/leuven/stg_00002/lcb/fderop/data/00000000_res/GrCh38_STAR_2.7.5_atac";
/staging/leuven/stg_00002/lcb/ghuls/software/STAR/source/STAR \
    --genomeDir $genomedir \
    --readFilesCommand 'gzip -c -d' \
    --runMode alignReads \
    --runThreadN 24 \
    --readFilesIn $r1 $r3 $r2 \
    --soloType CB_samTagOut \
    --sjdbGTFfile /ddn1/vol1/staging/leuven/stg_00002/lcb/fderop/data/00000000_res/GrCh38_STAR_2.7.5_atac/regulatory_modified_names.gff \
    --sjdbGTFfeatureExon regulatory \
    --soloCBwhitelist $whitelist \
    --soloCBmatchWLtype 1MM \
    --soloBarcodeReadLength 0 \
    --outSAMattributes NH HI CR CB \
    --outSAMtype BAM SortedByCoordinate \
    --outFileNamePrefix $out

I am currently rebuilding the genome using the genes.gtf instead of regulatory.gff and will update if that makes a difference to the mapping.

fderop commented 4 years ago

Update: same error when using genes.gtf to build the STAR genome:

mkdir /ddn1/vol1/staging/leuven/stg_00002/lcb/fderop/data/00000000_res/GrCh38_STAR_2.7.5_rna
/staging/leuven/stg_00002/lcb/ghuls/software/STAR/source/STAR \
    --runThreadN 24 \
    --runMode genomeGenerate \
    --genomeDir /ddn1/vol1/staging/leuven/stg_00002/lcb/fderop/data/00000000_res/GrCh38_STAR_2.7.5_rna \
    --genomeFastaFiles /ddn1/vol1/staging/leuven/stg_00002/lcb/fderop/data/00000000_res/GrCh38_STAR_2.7.5_rna/genome.fa \
    --sjdbGTFfile /ddn1/vol1/staging/leuven/stg_00002/lcb/fderop/data/00000000_res/GrCh38_STAR_2.7.5_rna/genes.gtf \
    --sjdbGTFfeatureExon exon \
    --sjdbOverhang 100

Then run STARsolo alignReads:

r1="fastq/BAP__1d7912__10x_scATAC_PBMC_3030663_3030456_replicate_1_S2_L001_R1_001.fastq.gz";
r2="fastq/BAP__1d7912__10x_scATAC_PBMC_3030663_3030456_replicate_1_S2_L001_R2_001.fastq.gz";
r3="fastq/BAP__1d7912__10x_scATAC_PBMC_3030663_3030456_replicate_1_S2_L001_R3_001.fastq.gz";
out="demultiplexed/vib1";
whitelist="/ddn1/vol1/staging/leuven/stg_00002/lcb/fderop/data/00000000_res/10x_atac_wl/737K-cratac-v1.txt";
genomedir="/ddn1/vol1/staging/leuven/stg_00002/lcb/fderop/data/00000000_res/GrCh38_STAR_2.7.5_rna";
/staging/leuven/stg_00002/lcb/ghuls/software/STAR/source/STAR \
    --genomeDir $genomedir \
    --readFilesCommand 'gzip -c -d' \
    --runMode alignReads \
    --runThreadN 24 \
    --readFilesIn $r1 $r3 $r2 \
    --soloType CB_samTagOut \
    --sjdbGTFfile /ddn1/vol1/staging/leuven/stg_00002/lcb/fderop/data/00000000_res/GrCh38_STAR_2.7.5_rna/genes.gtf \
    --sjdbGTFfeatureExon exon \
    --soloCBwhitelist $whitelist \
    --soloCBmatchWLtype 1MM \
    --soloBarcodeReadLength 0 \
    --outSAMattributes NH HI CR CB \
    --outSAMtype BAM SortedByCoordinate \
    --outFileNamePrefix $out
Jul 20 16:03:53 ..... started STAR run
Jul 20 16:03:53 ..... loading genome
Jul 20 16:04:06 ..... processing annotations GTF
Jul 20 16:04:14 ..... inserting junctions into the genome indices
Jul 20 16:05:31 ..... started mapping

gzip: stdout: Broken pipe

gzip: stdout: Broken pipe

gzip: stdout: Broken pipe
Segmentation fault

: 139

When running on the unzipped .fastq:

r1="fastq/vib1_R1_part.fastq";
r2="fastq/vib1_R2_part.fastq";
r3="fastq/vib1_R3_part.fastq";
out="demultiplexed/vib1";
whitelist="/ddn1/vol1/staging/leuven/stg_00002/lcb/fderop/data/00000000_res/10x_atac_wl/737K-cratac-v1.txt";
genomedir="/ddn1/vol1/staging/leuven/stg_00002/lcb/fderop/data/00000000_res/GrCh38_STAR_2.7.5_rna";
/staging/leuven/stg_00002/lcb/ghuls/software/STAR/source/STAR \
    --genomeDir $genomedir \
    --runMode alignReads \
    --runThreadN 24 \
    --readFilesIn $r1 $r3 $r2 \
    --soloType CB_samTagOut \
    --sjdbGTFfile /ddn1/vol1/staging/leuven/stg_00002/lcb/fderop/data/00000000_res/GrCh38_STAR_2.7.5_rna/genes.gtf \
    --sjdbGTFfeatureExon exon \
    --soloCBwhitelist $whitelist \
    --soloCBmatchWLtype 1MM \
    --soloBarcodeReadLength 0 \
    --outSAMattributes NH HI CR CB \
    --outSAMtype BAM SortedByCoordinate \
    --outFileNamePrefix $out
Jul 20 16:54:22 ..... started STAR run
Jul 20 16:54:22 ..... loading genome
Jul 20 16:54:35 ..... processing annotations GTF
Jul 20 16:54:42 ..... inserting junctions into the genome indices
Jul 20 16:55:57 ..... started mapping
Segmentation fault

: 139
alexdobin commented 4 years ago

Hi Florian,

did you also get a seg-fault on the small 3-read fastq that Gert send me before? If not, could you create a small subset of reads (<100k) that still causes the seg-fault, and send it to me? In the latest run, you used the standard 10X fasta and gtf?

Cheers Alex

ghuls commented 4 years ago

@alexdobin With the new index I don't get a crash for the 3 read FASTQ file anymore.

The new crash is harder to minimize.

nbr_fastq_lines=1300000;
nbr_fastq_lines_to_keep=26000;
nbr_fastq_lines_to_increase=10000;

status=0;

while [ $status -eq 0 ] ; do
    echo $nbr_fastq_lines $nbr_fastq_lines_to_keep $nbr_fastq_lines_to_increase;

    # Create shorter FASTQ files.
    zcat $r1 | head -n ${nbr_fastq_lines} | tail -n ${nbr_fastq_lines_to_keep} | gzip -2 > /tmp/$(basename ${r1});
    zcat $r2 | head -n ${nbr_fastq_lines} | tail -n ${nbr_fastq_lines_to_keep} | gzip -2 > /tmp/$(basename ${r2});
    zcat $r3 | head -n ${nbr_fastq_lines} | tail -n ${nbr_fastq_lines_to_keep}
 | gzip -2 > /tmp/$(basename ${r3});

     ./STAR \
        --genomeLoad LoadAndKeep \
        --limitBAMsortRAM 10000000000 \
        --genomeDir $genomedir \
        --readFilesCommand 'gzip -c -d' \
        --runMode alignReads \
        --runThreadN 1 \
        --readFilesIn /tmp/$(basename ${r1}) /tmp/$(basename ${r3}) /tmp/$(basename ${r2}) \
        --soloType CB_samTagOut \
        --sjdbGTFfeatureExon exon \
        --soloCBwhitelist $whitelist \
        --soloCBmatchWLtype 1MM \
        --soloBarcodeReadLength 0 \
        --outSAMattributes NH HI CR CB \
        --outSAMtype BAM SortedByCoordinate \
        --outFileNamePrefix $out;

    status=$?;

    nbr_fastq_lines=$((${nbr_fastq_lines} + ${nbr_fastq_lines_to_increase}));

done

1300000 26000 10000
Jul 28 11:41:36 ..... started STAR run
Jul 28 11:41:37 ..... loading genome
Jul 28 11:41:37 ..... started mapping
Segmentation fault

# When only keeping 25000 lines per FASTQ, there is no segmentation fault:
1300000 25000 10000
Jul 28 11:41:08 ..... started STAR run
Jul 28 11:41:08 ..... loading genome
Jul 28 11:41:09 ..... started mapping
Jul 28 11:41:10 ..... finished mapping
Jul 28 11:41:10 ..... started sorting BAM
Jul 28 11:41:10 ..... finished successfully
1310000 25000 10000
Jul 28 11:41:11 ..... started STAR run
Jul 28 11:41:11 ..... loading genome
Jul 28 11:41:11 ..... started mapping
Jul 28 11:41:12 ..... finished mapping
Jul 28 11:41:12 ..... started sorting BAM
Jul 28 11:41:12 ..... finished successfully
1320000 25000 10000
Jul 28 11:41:13 ..... started STAR run
Jul 28 11:41:14 ..... loading genome
Jul 28 11:41:14 ..... started mapping
Jul 28 11:41:15 ..... finished mapping
Jul 28 11:41:15 ..... started sorting BAM

# Keeping 25036 lines from 1300000 is the maximum number of lines for which there is no segmentation fault.
1300000 25036 10000
Jul 28 11:57:07 ..... started STAR run
Jul 28 11:57:07 ..... loading genome
Jul 28 11:57:07 ..... started mapping
Jul 28 11:57:08 ..... finished mapping
Jul 28 11:57:08 ..... started sorting BAM
Jul 28 11:57:08 ..... finished successfully
1310000 25036 10000
Jul 28 11:57:09 ..... started STAR run
Jul 28 11:57:09 ..... loading genome
Jul 28 11:57:10 ..... started mapping
Jul 28 11:57:10 ..... finished mapping
Jul 28 11:57:10 ..... started sorting BAM
Jul 28 11:57:10 ..... finished successfully
1320000 25036 10000
Jul 28 11:57:11 ..... started STAR run
Jul 28 11:57:12 ..... loading genome
Jul 28 11:57:12 ..... started mapping
^C

# From 25040 lines, we get a segmentation fault.
1300000 25040 10000
Jul 28 11:57:20 ..... started STAR run
Jul 28 11:57:21 ..... loading genome
Jul 28 11:57:21 ..... started mapping
Segmentation fault

FASTQ files from line 1300000 till 1300000+25040: BAP1d791210x_scATAC_PBMC_3030663_3030456_replicate_1_S2_L001_R1_001.fastq.gz BAP1d791210x_scATAC_PBMC_3030663_3030456_replicate_1_S2_L001_R2_001.fastq.gz BAP1d791210x_scATAC_PBMC_3030663_3030456_replicate_1_S2_L001_R3_001.fastq.gz

ghuls commented 4 years ago

@alexdobin STAR index, barcode whitelist and FASTQ files can be found at: https://temp.aertslab.org/star_issue_558/

alexdobin commented 4 years ago

Hi Florian, Gert,

thanks a lot for tracking the bug. I fixed, please try the 2.7.5b release.

Cheers Alex

ghuls commented 4 years ago

Hi Alex, just tried the latest STAR version with the full FASTQ files and STAR finished successfully. Thanks a lot for fixing this bug.

alexdobin commented 4 years ago

Hi Gert,

thanks for testing it, and for the initial bug report. I think we can close the issue now. I would suggest starting a new issue in the future rather than reusing the old one.

Cheers Alex

14zac2 commented 2 years ago

Hi @ghuls - I'm trying to figure out how to run STAR on scATAC-seq data generated by the 10X multiome pipeline (just extracting the scATAC-seq data). I'm not sure how the barcodes are stored. Are your barcodes in star_atac_R2.short.fastq.gz? The first 16bp in my R2 files do not match the cell barcodes found by CellRanger. Thanks!

ghuls commented 2 years ago

Very likely you need to take the reverse complement of the 16bp barcode of star_atac_R2.short.fastq.gz if you want to match is manually with the 10x CellRangerATAC barcode list.