Closed auesro closed 2 years ago
As mentioned previously, you need to trim the UMIs and barcodes entirely from the reads with umitools extract
. After that, only right (_2.fastq
) reads contain the sequences that can be assembled.
You have several options here. You can align the right reads (with bwa/minimap2) to either your CDS, reference transcripts, or the reference genome. For example, the gene for "Bgal" should be GLB1 (chr3:32,996,617-33,097,146 in GRCh38).
After alignment, you can extract the aligned reads with samtools fastq
.
samtools fastq -F 0x4 alignment.bam > extracted_reads.fastq
You can assemble the extracted reads with RNA-Bloom:
java -jar RNA-Bloom.jar -sef extracted_reads.fastq -ntcard -t THREADS -outdir OUTDIR
A couple of doubts: -Read2 should not contain any UMI or Barcode right? I mean, my average insert size is 400 bp, read2 is 88 bases long...unlikely any of them included an insert small enough to get to the other side ,right? -If I align my reads to the CDS (thats the only thing I have by now) I will keep only those reads that map to the CDS and will discard reads mapping outside...precisely what I am looking for: those reads that map to the 3'UTR....or am I missing something?
Based on the sample data from 10x, I don't think the right read (98-bp) contains any UMI/barcodes at all. But if you believe that yours do, please do remove them before alignment/assembly.
The insert sizes probably peak at 400 bp, but they should vary a bit.
Yes, you are correct about the CDS missing the UTR. That's why I suggested reference transcript and genome as alternatives.
The problem is that being a transgene (a sequence inserted artificially in the mouse genome in order to be expressed by select cells so we can identify them), the sequence is not present in any reference....I only know the CDS. I need a way to select the reads that map to the CDS and use those to "build" a transcript by adding more reads (specially in the direction 3' of the CDS)...thats why I initially thought Kollector could do it.
You can just assemble the trimmed reads and then extract the transcripts that align to your CDS?
Mmm I guess so. You mean: -Discard read1 -Trim any UMI, cell barcode, etc from read2 -Perform de novo transcriptome assembly with ALL read2 using RNA-bloom -BLAST the assembled transcriptome against my CDS -Pick candidate transcripts and identify the 3'UTR from those long enough ?
Yes, that seems like your only option because your left read most likely contain only UMIs + cell barcodes. Kollector would not work in this case because its recruitment step expects paired end reads where one paired read would anchor on a given sequence.
Alright. Hope RNA-bloom will not crash! Then my next question is: I have 2 samples (=libraries, from different animals) should I supply RNA-bloom with the 2 read2 files, or better do it one by one?
Let me know if you run into any issues. If they are different animals, then you should assemble them separately.
Thanks a lot, Ka! Will keep you posted
Unfortunately, I ran into something...
(RNA_Bloom) [lab_eh_2@NeuroServer RNA_Bloom]$ rnabloom -fpr 0.01 -ser /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_unpaired_norRNA_2.fq.gz /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_val_norRNA_2.fq.gz -ntcard -t 30 -outdir output
RNA-Bloom v1.4.3
args: [-fpr, 0.01, -ser, /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_unpaired_norRNA_2.fq.gz, /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_val_norRNA_2.fq.gz, -ntcard, -t, 30, -outdir, output]
name: rnabloom
outdir: output
K-mer counting with ntCard...
Running command: `ntcard -t 30 -k 25 -c 65535 -p output/rnabloom @output/rnabloom.ntcard.readslist.txt`...
Parsing histogram file `output/rnabloom_k25.hist`...
Unique k-mers (k=25): 1.990.801.340
Unique k-mers (k=25,c>1): 802.505.180
K-mer counting completed in 13m 57s
Bloom filters Memory (GB)
====================================
de Bruijn graph: 4.399367
k-mer counting: 14.187311
paired k-mers (reads): 4.399367
paired k-mers (frags): 4.399367
screening: 4.399367
====================================
Total: 31.784777
> Stage 1: Construct graph from reads (k=25)
Sampling read lengths (l>=25, n=1000) from each file...
min Q1 M Q3 max file
61 87 88 88 88 /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_unpaired_norRNA_2.fq.gz
78 87 88 88 88 /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_val_norRNA_2.fq.gz
Weighted [Q1,M,Q3]: [87, 88, 88]
Absolute [min,max]: [61, 88]
Paired k-mers distance: 52
Max. tip length: 63
Parsing `/home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_unpaired_norRNA_2.fq.gz`...
Parsed 100.549 sequences in 1.42s
Parsing `/home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_val_norRNA_2.fq.gz`...
Parsed 345.166.566 sequences in 1h 6m 3s
Parsed 345.267.115 sequences from 2 files.
DBG Bloom filter FPR: 0.999 %
Counting Bloom filter FPR: 1.01 %
Reads paired k-mers Bloom filter FPR: 0.25 %
> Stage 1 completed in 1h 6m 50s
> Skipping Stage 2 for single-end reads.
> Stage 3: Assemble transcripts for "rnabloom"
ERROR: null
java.lang.NullPointerException
at rnabloom.RNABloom$SingleEndReadsIterator.<init>(RNABloom.java:4670)
at rnabloom.RNABloom.assembleSingleEndReads(RNABloom.java:4805)
at rnabloom.RNABloom.assembleTranscriptsSE(RNABloom.java:5379)
at rnabloom.RNABloom.main(RNABloom.java:7104)
(RNA_Bloom) [lab_eh_2@NeuroServer RNA_Bloom]$
´´´
Any ideas whats happening here?
It looks like you have found a bug.
You can bypass it by using both -ser
and -sef
, i.e.
rnabloom -fpr 0.01 \
-ser /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_unpaired_norRNA_2.fq.gz \
-sef /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_val_norRNA_2.fq.gz \
-ntcard -t 30 -outdir output
Since your reads are not strand-specific, it doesn't matter whether you specify your reads with -ser
or -sef
.
I will fix this in the next release. Sorry about that!
Alright, good to know it was not something worse. I was already touching the JAVA options.
So you are saying the problem was I specified 2 input read files with only one argument, right? It should have been -ser X -sef Y, not -ser X Y. But then that means the bug is that you cannot, so far, provide more than 1 read file per "modality" -left, -right, -sef, -ser, correct?
Also, I was wondering if I could use the -mergepool option here if I only have single end reads: how should I specify the files from different samples in the readslist.txt
file? Given I need to put them in the 4th column, do I just leave the columns 2 and 3 empty?
Sample1 sef_file ser_file
Sample2 sef_file ser_file
No, the bug was that -ser
and -sef
had to be used together.
In in version 1.4.3, the -pool
option doesn't support single-end reads (yet):
https://github.com/bcgsc/RNA-Bloom/releases/tag/v1.4.3
support for single-end reads in -pool will be implemented in a future release
It will be available in the next release though.
Ok, thanks for the info! Is it possible to use the -mergepool parameter if I provide several manually assembled transcriptomes?
I am running it now with the first sample, fingers crossed!
Just wondering, is it normal that it will take more than 9 hours to assemble a transcriptome from 345.267.115 sequences using 30 threads? (Currently at "Reducing redundancy in assembled transcripts...").
At this stage, it is reducing the redundancy in the assembled sequences to generate *.transcripts.nr.fa
, but you can already use the sequences in *.transcripts.fa
.
345 million reads is not a small amount data and assembly of single-end reads is less efficient than paired-end assembly. So, that is kind of expected.
Alright, thanks Ka!
Regarding output till now, I guess the transcripts.short.fa
file contains the assembled transcripts smaller than certain threshold, right? Is it taking those into account too for the redundancy reduction?
Yes, the default length threshold is 200. Those short sequences are not included in the redundancy reduction process.
I come back to report the assembly was quite succesful. However, now I am trying to assemble reads from a different experiment (traditional RNAseq, single end reads, downloaded from SRA) and face a similar issue to here:
(RNA_Bloom) [lab_eh_2@NeuroServer denovo_transcriptome]$ rnabloom -fpr 0.01 -sef /home/LAB/lab_eh_2/AER_Data/Chamessian/3/trimmed/SRR7039666_3_trimmed.fq.gz -ntcard -t 30 -outdir output
RNA-Bloom v1.4.3
args: [-fpr, 0.01, -sef, /home/LAB/lab_eh_2/AER_Data/Chamessian/3/trimmed/SRR7039666_3_trimmed.fq.gz, -ntcard, -t, 30, -outdir, output]
name: rnabloom
outdir: output
WARNING: Output directory does not exist!
Created output directory at `output`
K-mer counting with ntCard...
Running command: `ntcard -t 30 -k 25 -c 65535 -p output/rnabloom @output/rnabloom.ntcard.readslist.txt`...
Parsing histogram file `output/rnabloom_k25.hist`...
Unique k-mers (k=25): 262.333.277
Unique k-mers (k=25,c>1): 129.744.533
K-mer counting completed in 31.585s
Bloom filters Memory (GB)
====================================
de Bruijn graph: 0.57971644
k-mer counting: 2.2937248
paired k-mers (reads): 0.57971644
paired k-mers (frags): 0.57971644
screening: 0.57971644
====================================
Total: 4.612591
> Stage 1: Construct graph from reads (k=25)
Sampling read lengths (l>=25, n=1000) from each file...
min Q1 M Q3 max file
28 50 51 51 51 /home/LAB/lab_eh_2/AER_Data/Chamessian/3/trimmed/SRR7039666_3_trimmed.fq.gz
Weighted [Q1,M,Q3]: [50, 51, 51]
Absolute [min,max]: [28, 51]
Paired k-mers distance: 15
Max. tip length: 26
Parsing `/home/LAB/lab_eh_2/AER_Data/Chamessian/3/trimmed/SRR7039666_3_trimmed.fq.gz`...
Parsed 25.694.006 sequences in 2m 14s
DBG Bloom filter FPR: 0.998 %
Counting Bloom filter FPR: 1.01 %
Reads paired k-mers Bloom filter FPR: 0.335 %
> Stage 1 completed in 2m 23s
> Skipping Stage 2 for single-end reads.
> Stage 3: Assemble transcripts for "rnabloom"
ERROR: null
java.lang.NullPointerException
at rnabloom.RNABloom$SingleEndReadsIterator.<init>(RNABloom.java:4679)
at rnabloom.RNABloom.assembleSingleEndReads(RNABloom.java:4805)
at rnabloom.RNABloom.assembleTranscriptsSE(RNABloom.java:5379)
at rnabloom.RNABloom.main(RNABloom.java:7104)
How do I specify the reads when I only have 1 file?
Yes, that was the same bug as before. Sorry for the inconvenience.
You need to split your FASTQ file in half and use -ser
and -sef
options together.
For example, you can split the file like so:
zcat SRR7039666_3_trimmed.fq.gz | split -l 51400000 - partition_
mv partition_aa SRR7039666_3_trimmed.01.fq
mv partition_ab SRR7039666_3_trimmed.02.fq
gzip SRR7039666_3_trimmed.01.fq SRR7039666_3_trimmed.02.fq
rnabloom -fpr 0.01 -sef SRR7039666_3_trimmed.01.fq.gz -ser SRR7039666_3_trimmed.02.fq.gz -ntcard -t 30 -outdir output
Ok, that was the smart move.
I just ran rnabloom with the same reads file twice like:
rnabloom -fpr 0.01 -sef /home/LAB/lab_eh_2/AER_Data/Chamessian/3/trimmed/SRR7039666_3_trimmed.fq.gz -ser /home/LAB/lab_eh_2/AER_Data/Chamessian/3/trimmed/SRR7039666_3_trimmed.fq.gz -ntcard -t 30 -outdir output
Is that crazy? The file is small so the assembly already finished, can I use those results or better not?
It is okay if noise is not an issue. Since the read file is relatively small, I suggest you re-run the assembly with the split files.
Alright! Thanks!
I just realized that I messed up the number for the example split
command previously.
25,694,006 reads * 4 lines/read = 102,776,024 lines
Split that in half would be 51,388,012 lines per partition.
So, something like -l 51400000
would yield exactly two files.
Don´t worry, fixed it! Thanks!
Hi Ka,
Not sure if I should open a new issue or keep adding here, because this is a new error not discussed previously. If you prefer, I will close this and post as new issue. In the meantime:
I have been de novo assembling several public datasets. In this case from a RNAseq experiment (not single cell). The only difference I can see is that the size of the reads files is way bigger than previous (I have merged together 33 SRR single end files and then split in 2 as recommended above: each reads file is about 39GB). And I get this:
(RNA_Bloom) [lab_eh_2@NeuroServer trimmed]$ rnabloom -fpr 0.01 -sef *_00_trimmed.fq.gz -ser *_01_trimmed.fq.gz -ntcard -t 30 -outdir output
RNA-Bloom v1.4.3
args: [-fpr, 0.01, -sef, Chongtam_00_trimmed.fq.gz, -ser, Chongtam_01_trimmed.fq.gz, -ntcard, -t, 30, -outdir, output]
name: rnabloom
outdir: output
K-mer counting with ntCard...
Running command: `ntcard -t 30 -k 25 -c 65535 -p output/rnabloom @output/rnabloom.ntcard.readslist.txt`...
ERROR: Error running ntCard!
Not a very informative error...any ideas??
Hi,
Can you please run this command and see whether you got an error from ntCard?
ntcard -t 30 -k 25 -c 65535 -p output/rnabloom @output/rnabloom.ntcard.readslist.txt
Thanks!
Mmmm this is what I get...
(RNA_Bloom) [lab_eh_2@NeuroServer trimmed]$ ntcard -t 30 -k 25 -c 65535 -p output/rnabloom@output/rnabloom.ntcard.readslist.txt
gzip: Chongtam_00_trimmed.fq.gz: unexpected end of file
PID 3694 exited with status 1
That means your fastq file Chongtam_00_trimmed.fq.gz
is truncated.
Ok.
I made it by concatenating several single end fastqs, counting lines and dividing in half as previously advised. How could the file be truncated?
That error actually came from gzip
. ntCard was simply using it to decompress your FASTQ file. So, your gzip file is corrupted according to gzip
.
Try this and see if it is really truncated:
zcat Chongtam_00_trimmed.fq.gz | tail -n 4
You can try decompressing and then compressing the file again.
Result:
(RNA_Bloom) [lab_eh_2@NeuroServer trimmed]$ zcat Chongtam_00_trimmed.fq.gz | tail -n 4
gzip: Chongtam_00_trimmed.fq.gz: unexpected end of file
@SRR13563559.53035757 NB501946:387:HNHTGBGXC:1:23107:17045:17168 length=70 CACATATTCATAATCAGGGTCTTGCAGCAATTGAGAAGTGTGGAGAAGCTGCTCTAAATGACCCCA
+
AA
(RNA_Bloom) [lab_eh_2@NeuroServer trimmed]$
How did this happen??
That definitely looks truncated.
Check your original FASTQ file that contains the read SRR13563559.53035757
.
Otherwise, you may have had multiple commands running at the same time and they write to the same file.
You were right, the original reads file:
(in202109) [lab_eh_2@NeuroServer Chongtam]$ zcat SRR13563559.fastq.gz | tail -n 4
@SRR13563559.68742091 NB501946:387:HNHTGBGXC:1:11311:18817:13247 length=70
TAGGATTGTTTGTACGGAATAGAGAAAAGTGTGGCAGAAAAGCGGAACGCGCAAAGAGAACGTATATGGT
+SRR13563559.68742091 NB501946:387:HNHTGBGXC:1:11311:18817:13247 length=70
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
So I am reconstructing again, will report back!
Thanks!
Finally finished succesfully, thanks a lot for your help @kmnip !
Hi all,
I have a set of 10x scRNAseq (nuclei instead of cells, to be precise) datasets derived from a mouse line expressing several transgenes (sfGFP and Bgal, in this case). I only know (and only hypothetically) the sequence for the CDS of these transgenes. Given that 10x mRNA sequencing mostly targets the most 3' end of the 3'UTR of mRNAs, I need to be able to reconstruct as much as possible of the transgene sequence in order to correctly map reads (most of them targeting the 3'UTR) and identify positive nuclei expressing the transgenes.
Any suggestions?
Thanks!
A