bcgsc / RNA-Bloom

:hibiscus: reference-free transcriptome assembly for short and long reads
Other
96 stars 7 forks source link

3'UTR transgene identification from 10x scRNAseq reads #34

Closed auesro closed 2 years ago

auesro commented 2 years ago

Hi all,

I have a set of 10x scRNAseq (nuclei instead of cells, to be precise) datasets derived from a mouse line expressing several transgenes (sfGFP and Bgal, in this case). I only know (and only hypothetically) the sequence for the CDS of these transgenes. Given that 10x mRNA sequencing mostly targets the most 3' end of the 3'UTR of mRNAs, I need to be able to reconstruct as much as possible of the transgene sequence in order to correctly map reads (most of them targeting the 3'UTR) and identify positive nuclei expressing the transgenes.

Any suggestions?

Thanks!

A

kmnip commented 2 years ago

As mentioned previously, you need to trim the UMIs and barcodes entirely from the reads with umitools extract. After that, only right (_2.fastq) reads contain the sequences that can be assembled.

You have several options here. You can align the right reads (with bwa/minimap2) to either your CDS, reference transcripts, or the reference genome. For example, the gene for "Bgal" should be GLB1 (chr3:32,996,617-33,097,146 in GRCh38).

After alignment, you can extract the aligned reads with samtools fastq.

samtools fastq -F 0x4 alignment.bam > extracted_reads.fastq

You can assemble the extracted reads with RNA-Bloom:

java -jar RNA-Bloom.jar -sef extracted_reads.fastq -ntcard -t THREADS -outdir OUTDIR
auesro commented 2 years ago

A couple of doubts: -Read2 should not contain any UMI or Barcode right? I mean, my average insert size is 400 bp, read2 is 88 bases long...unlikely any of them included an insert small enough to get to the other side ,right? -If I align my reads to the CDS (thats the only thing I have by now) I will keep only those reads that map to the CDS and will discard reads mapping outside...precisely what I am looking for: those reads that map to the 3'UTR....or am I missing something?

kmnip commented 2 years ago

Based on the sample data from 10x, I don't think the right read (98-bp) contains any UMI/barcodes at all. But if you believe that yours do, please do remove them before alignment/assembly.

The insert sizes probably peak at 400 bp, but they should vary a bit.

Yes, you are correct about the CDS missing the UTR. That's why I suggested reference transcript and genome as alternatives.

auesro commented 2 years ago

The problem is that being a transgene (a sequence inserted artificially in the mouse genome in order to be expressed by select cells so we can identify them), the sequence is not present in any reference....I only know the CDS. I need a way to select the reads that map to the CDS and use those to "build" a transcript by adding more reads (specially in the direction 3' of the CDS)...thats why I initially thought Kollector could do it.

kmnip commented 2 years ago

You can just assemble the trimmed reads and then extract the transcripts that align to your CDS?

auesro commented 2 years ago

Mmm I guess so. You mean: -Discard read1 -Trim any UMI, cell barcode, etc from read2 -Perform de novo transcriptome assembly with ALL read2 using RNA-bloom -BLAST the assembled transcriptome against my CDS -Pick candidate transcripts and identify the 3'UTR from those long enough ?

kmnip commented 2 years ago

Yes, that seems like your only option because your left read most likely contain only UMIs + cell barcodes. Kollector would not work in this case because its recruitment step expects paired end reads where one paired read would anchor on a given sequence.

auesro commented 2 years ago

Alright. Hope RNA-bloom will not crash! Then my next question is: I have 2 samples (=libraries, from different animals) should I supply RNA-bloom with the 2 read2 files, or better do it one by one?

kmnip commented 2 years ago

Let me know if you run into any issues. If they are different animals, then you should assemble them separately.

auesro commented 2 years ago

Thanks a lot, Ka! Will keep you posted

auesro commented 2 years ago

Unfortunately, I ran into something...


(RNA_Bloom) [lab_eh_2@NeuroServer RNA_Bloom]$ rnabloom -fpr 0.01 -ser /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_unpaired_norRNA_2.fq.gz /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_val_norRNA_2.fq.gz -ntcard -t 30 -outdir output
RNA-Bloom v1.4.3
args: [-fpr, 0.01, -ser, /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_unpaired_norRNA_2.fq.gz, /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_val_norRNA_2.fq.gz, -ntcard, -t, 30, -outdir, output]

name:   rnabloom
outdir: output

K-mer counting with ntCard...
Running command: `ntcard -t 30 -k 25 -c 65535 -p output/rnabloom @output/rnabloom.ntcard.readslist.txt`...
Parsing histogram file `output/rnabloom_k25.hist`...
Unique k-mers (k=25):     1.990.801.340
Unique k-mers (k=25,c>1): 802.505.180
K-mer counting completed in 13m 57s

Bloom filters          Memory (GB)
====================================
de Bruijn graph:       4.399367
k-mer counting:        14.187311
paired k-mers (reads): 4.399367
paired k-mers (frags): 4.399367
screening:             4.399367
====================================
Total:                 31.784777

> Stage 1: Construct graph from reads (k=25)
Sampling read lengths (l>=25, n=1000) from each file...
        min     Q1      M       Q3      max     file
        61      87      88      88      88      /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_unpaired_norRNA_2.fq.gz
        78      87      88      88      88      /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_val_norRNA_2.fq.gz
Weighted [Q1,M,Q3]: [87, 88, 88]
Absolute [min,max]: [61, 88]
Paired k-mers distance: 52
Max. tip length: 63
Parsing `/home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_unpaired_norRNA_2.fq.gz`...
Parsed 100.549 sequences in 1.42s
Parsing `/home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_val_norRNA_2.fq.gz`...
Parsed 345.166.566 sequences in 1h 6m 3s
Parsed 345.267.115 sequences from 2 files.
DBG Bloom filter FPR:                 0.999 %
Counting Bloom filter FPR:            1.01 %
Reads paired k-mers Bloom filter FPR: 0.25 %
> Stage 1 completed in 1h 6m 50s

> Skipping Stage 2 for single-end reads.

> Stage 3: Assemble transcripts for "rnabloom"
ERROR: null
java.lang.NullPointerException
        at rnabloom.RNABloom$SingleEndReadsIterator.<init>(RNABloom.java:4670)
        at rnabloom.RNABloom.assembleSingleEndReads(RNABloom.java:4805)
        at rnabloom.RNABloom.assembleTranscriptsSE(RNABloom.java:5379)
        at rnabloom.RNABloom.main(RNABloom.java:7104)
(RNA_Bloom) [lab_eh_2@NeuroServer RNA_Bloom]$
´´´
Any ideas whats happening here? 
kmnip commented 2 years ago

It looks like you have found a bug. You can bypass it by using both -ser and -sef, i.e.

rnabloom -fpr 0.01 \
-ser /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_unpaired_norRNA_2.fq.gz \
-sef /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_val_norRNA_2.fq.gz \
-ntcard -t 30 -outdir output

Since your reads are not strand-specific, it doesn't matter whether you specify your reads with -ser or -sef.

I will fix this in the next release. Sorry about that!

auesro commented 2 years ago

Alright, good to know it was not something worse. I was already touching the JAVA options.

So you are saying the problem was I specified 2 input read files with only one argument, right? It should have been -ser X -sef Y, not -ser X Y. But then that means the bug is that you cannot, so far, provide more than 1 read file per "modality" -left, -right, -sef, -ser, correct?

Also, I was wondering if I could use the -mergepool option here if I only have single end reads: how should I specify the files from different samples in the readslist.txt file? Given I need to put them in the 4th column, do I just leave the columns 2 and 3 empty?

Sample1     sef_file ser_file
Sample2     sef_file ser_file
kmnip commented 2 years ago

No, the bug was that -ser and -sef had to be used together.

In in version 1.4.3, the -pool option doesn't support single-end reads (yet): https://github.com/bcgsc/RNA-Bloom/releases/tag/v1.4.3

support for single-end reads in -pool will be implemented in a future release

It will be available in the next release though.

auesro commented 2 years ago

Ok, thanks for the info! Is it possible to use the -mergepool parameter if I provide several manually assembled transcriptomes?

I am running it now with the first sample, fingers crossed!

auesro commented 2 years ago

Just wondering, is it normal that it will take more than 9 hours to assemble a transcriptome from 345.267.115 sequences using 30 threads? (Currently at "Reducing redundancy in assembled transcripts...").

kmnip commented 2 years ago

At this stage, it is reducing the redundancy in the assembled sequences to generate *.transcripts.nr.fa, but you can already use the sequences in *.transcripts.fa.

345 million reads is not a small amount data and assembly of single-end reads is less efficient than paired-end assembly. So, that is kind of expected.

auesro commented 2 years ago

Alright, thanks Ka!

Regarding output till now, I guess the transcripts.short.fa file contains the assembled transcripts smaller than certain threshold, right? Is it taking those into account too for the redundancy reduction?

kmnip commented 2 years ago

Yes, the default length threshold is 200. Those short sequences are not included in the redundancy reduction process.

auesro commented 2 years ago

I come back to report the assembly was quite succesful. However, now I am trying to assemble reads from a different experiment (traditional RNAseq, single end reads, downloaded from SRA) and face a similar issue to here:

(RNA_Bloom) [lab_eh_2@NeuroServer denovo_transcriptome]$ rnabloom -fpr 0.01 -sef /home/LAB/lab_eh_2/AER_Data/Chamessian/3/trimmed/SRR7039666_3_trimmed.fq.gz -ntcard -t 30 -outdir output
RNA-Bloom v1.4.3
args: [-fpr, 0.01, -sef, /home/LAB/lab_eh_2/AER_Data/Chamessian/3/trimmed/SRR7039666_3_trimmed.fq.gz, -ntcard, -t, 30, -outdir, output]

name:   rnabloom
outdir: output
WARNING: Output directory does not exist!
Created output directory at `output`

K-mer counting with ntCard...
Running command: `ntcard -t 30 -k 25 -c 65535 -p output/rnabloom @output/rnabloom.ntcard.readslist.txt`...
Parsing histogram file `output/rnabloom_k25.hist`...
Unique k-mers (k=25):     262.333.277
Unique k-mers (k=25,c>1): 129.744.533
K-mer counting completed in 31.585s

Bloom filters          Memory (GB)
====================================
de Bruijn graph:       0.57971644
k-mer counting:        2.2937248
paired k-mers (reads): 0.57971644
paired k-mers (frags): 0.57971644
screening:             0.57971644
====================================
Total:                 4.612591

> Stage 1: Construct graph from reads (k=25)
Sampling read lengths (l>=25, n=1000) from each file...
        min     Q1      M       Q3      max     file
        28      50      51      51      51      /home/LAB/lab_eh_2/AER_Data/Chamessian/3/trimmed/SRR7039666_3_trimmed.fq.gz
Weighted [Q1,M,Q3]: [50, 51, 51]
Absolute [min,max]: [28, 51]
Paired k-mers distance: 15
Max. tip length: 26
Parsing `/home/LAB/lab_eh_2/AER_Data/Chamessian/3/trimmed/SRR7039666_3_trimmed.fq.gz`...
Parsed 25.694.006 sequences in 2m 14s
DBG Bloom filter FPR:                 0.998 %
Counting Bloom filter FPR:            1.01 %
Reads paired k-mers Bloom filter FPR: 0.335 %
> Stage 1 completed in 2m 23s

> Skipping Stage 2 for single-end reads.

> Stage 3: Assemble transcripts for "rnabloom"
ERROR: null
java.lang.NullPointerException
        at rnabloom.RNABloom$SingleEndReadsIterator.<init>(RNABloom.java:4679)
        at rnabloom.RNABloom.assembleSingleEndReads(RNABloom.java:4805)
        at rnabloom.RNABloom.assembleTranscriptsSE(RNABloom.java:5379)
        at rnabloom.RNABloom.main(RNABloom.java:7104)

How do I specify the reads when I only have 1 file?

kmnip commented 2 years ago

Yes, that was the same bug as before. Sorry for the inconvenience. You need to split your FASTQ file in half and use -ser and -sef options together. For example, you can split the file like so:

zcat SRR7039666_3_trimmed.fq.gz | split -l 51400000 - partition_
mv partition_aa SRR7039666_3_trimmed.01.fq
mv partition_ab SRR7039666_3_trimmed.02.fq
gzip SRR7039666_3_trimmed.01.fq SRR7039666_3_trimmed.02.fq
rnabloom -fpr 0.01 -sef SRR7039666_3_trimmed.01.fq.gz -ser SRR7039666_3_trimmed.02.fq.gz -ntcard -t 30 -outdir output
auesro commented 2 years ago

Ok, that was the smart move. I just ran rnabloom with the same reads file twice like: rnabloom -fpr 0.01 -sef /home/LAB/lab_eh_2/AER_Data/Chamessian/3/trimmed/SRR7039666_3_trimmed.fq.gz -ser /home/LAB/lab_eh_2/AER_Data/Chamessian/3/trimmed/SRR7039666_3_trimmed.fq.gz -ntcard -t 30 -outdir output Is that crazy? The file is small so the assembly already finished, can I use those results or better not?

kmnip commented 2 years ago

It is okay if noise is not an issue. Since the read file is relatively small, I suggest you re-run the assembly with the split files.

auesro commented 2 years ago

Alright! Thanks!

kmnip commented 2 years ago

I just realized that I messed up the number for the example split command previously. 25,694,006 reads * 4 lines/read = 102,776,024 lines Split that in half would be 51,388,012 lines per partition. So, something like -l 51400000 would yield exactly two files.

auesro commented 2 years ago

Don´t worry, fixed it! Thanks!

auesro commented 2 years ago

Hi Ka,

Not sure if I should open a new issue or keep adding here, because this is a new error not discussed previously. If you prefer, I will close this and post as new issue. In the meantime:

I have been de novo assembling several public datasets. In this case from a RNAseq experiment (not single cell). The only difference I can see is that the size of the reads files is way bigger than previous (I have merged together 33 SRR single end files and then split in 2 as recommended above: each reads file is about 39GB). And I get this:

(RNA_Bloom) [lab_eh_2@NeuroServer trimmed]$ rnabloom -fpr 0.01 -sef *_00_trimmed.fq.gz -ser *_01_trimmed.fq.gz -ntcard -t 30 -outdir output
RNA-Bloom v1.4.3
args: [-fpr, 0.01, -sef, Chongtam_00_trimmed.fq.gz, -ser, Chongtam_01_trimmed.fq.gz, -ntcard, -t, 30, -outdir, output]

name:   rnabloom
outdir: output

K-mer counting with ntCard...
Running command: `ntcard -t 30 -k 25 -c 65535 -p output/rnabloom @output/rnabloom.ntcard.readslist.txt`...
ERROR: Error running ntCard!

Not a very informative error...any ideas??

kmnip commented 2 years ago

Hi, Can you please run this command and see whether you got an error from ntCard? ntcard -t 30 -k 25 -c 65535 -p output/rnabloom @output/rnabloom.ntcard.readslist.txt Thanks!

auesro commented 2 years ago

Mmmm this is what I get...

(RNA_Bloom) [lab_eh_2@NeuroServer trimmed]$  ntcard -t 30 -k 25 -c 65535 -p output/rnabloom@output/rnabloom.ntcard.readslist.txt

gzip: Chongtam_00_trimmed.fq.gz: unexpected end of file                                                                                  
PID 3694 exited with status 1     
kmnip commented 2 years ago

That means your fastq file Chongtam_00_trimmed.fq.gz is truncated.

auesro commented 2 years ago

Ok.

I made it by concatenating several single end fastqs, counting lines and dividing in half as previously advised. How could the file be truncated?

kmnip commented 2 years ago

That error actually came from gzip. ntCard was simply using it to decompress your FASTQ file. So, your gzip file is corrupted according to gzip. Try this and see if it is really truncated:

zcat Chongtam_00_trimmed.fq.gz | tail -n 4

You can try decompressing and then compressing the file again.

auesro commented 2 years ago

Result:

(RNA_Bloom) [lab_eh_2@NeuroServer trimmed]$ zcat Chongtam_00_trimmed.fq.gz | tail -n 4        
gzip: Chongtam_00_trimmed.fq.gz: unexpected end of file
@SRR13563559.53035757 NB501946:387:HNHTGBGXC:1:23107:17045:17168 length=70                                                               CACATATTCATAATCAGGGTCTTGCAGCAATTGAGAAGTGTGGAGAAGCTGCTCTAAATGACCCCA
+
AA
(RNA_Bloom) [lab_eh_2@NeuroServer trimmed]$       

How did this happen??

kmnip commented 2 years ago

That definitely looks truncated. Check your original FASTQ file that contains the read SRR13563559.53035757. Otherwise, you may have had multiple commands running at the same time and they write to the same file.

auesro commented 2 years ago

You were right, the original reads file:

(in202109) [lab_eh_2@NeuroServer Chongtam]$ zcat SRR13563559.fastq.gz | tail -n 4
@SRR13563559.68742091 NB501946:387:HNHTGBGXC:1:11311:18817:13247 length=70
TAGGATTGTTTGTACGGAATAGAGAAAAGTGTGGCAGAAAAGCGGAACGCGCAAAGAGAACGTATATGGT
+SRR13563559.68742091 NB501946:387:HNHTGBGXC:1:11311:18817:13247 length=70
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

So I am reconstructing again, will report back!

Thanks!

auesro commented 2 years ago

Finally finished succesfully, thanks a lot for your help @kmnip !