Closed GKok closed 3 years ago
Update: it is sample specific.
When using the second bam file, the error disappears and I get the codon frequency plot, as expected.
The third file throws the exact same error for a different row:
Error in .Call2("C_solve_user_SEW", refwidths, start, end, width, translate.negative.coord, :
solving row 5611044: 'allow.nonnarrowing' is FALSE and the supplied end (973) is > refwidth
However, my question remains relevant, as I still would like to analyse all samples.
Hi Gautam, thank you for using riboWaltz.
First of all, is there any reason for using reads_list in the psite_offset() function and a filtered_list in the psite_info() function? Is it just an error? It should be:
psite_offset <- psite(reads_list, plot = TRUE)
reads_psite_list <- psite_info(reads_list, psite_offset)
Since the error you reported seems to be row-specific, could you check line 3172563 of _reads_psite_list [["SAMPLEtranscriptome"]] and see if there is something different wrt other rows? And at the same time could you check the length of the transcript corresponding to that row and if the psite position is falls within the transcript? I think it a matter of non-matching lengths, and this hypothesis is more or less confirmed by other issues in the web.
Best Fabio
Dear Fabio,
Thank you for your prompt response.
I did add a filter-step in between, but to keep my question short and to the point, I removed most of my code. I simply forgot to change filtered_list to reads_list. Well-noticed :-)
Here's the reads_list output of the two lines that cause the error, with two lines before and after.
> reads_list[["FSF001_transcriptome"]][3172561:3172565,]
transcript end5 end3 length cds_start cds_stop
1: ENST00000513016.5_1 1958 1984 27 0 0
2: ENST00000513960.5_1 1860 1886 27 254 3973
3: ENST00000265077.8_1 4854 4880 27 287 10477
4: ENST00000343200.9_1 1963 1989 27 357 7586
5: ENST00000369085.8_1 1198 1225 28 297 2024
> reads_list[["FSF003_transcriptome"]][5611042:5611046,]
transcript end5 end3 length cds_start cds_stop
1: ENST00000531493.5_1 285 314 30 81 788
2: ENST00000312562.7_1 254 283 30 50 865
3: ENST00000485591.1_1 587 616 30 0 0
4: ENST00000409416.6_1 931 960 30 1 2487
5: ENST00000397763.5_1 958 987 30 28 2784
Anything special about these values?
Gautam
Nothing special about those values, not really. But it is also true that in the first case cds_stop is veeeery high while in the second case is 0, so maybe... For "ENST00000265077.8_1" and "ENST00000485591.1_1" (the transcript showing errors) could you please 1) show me their corresponding lines in annotation_dt and 2) check if the same transcripts are present in those samples not throwing errors and 3) report the width of their sequence stored in the fasta? For the latter task here is the code to extract the transcript sequences from the genomic sequences (if it doesn't work, I might have forgotten some pieces. Just load all dependencies and check what happens):
fastapath <- "/path/to/Genome/GRCh37.primary_assembly.genome.fa"
refseq_sep = " "
gtfpath = gtf
path_to_gtf <- gtfpath
txdbanno <- GenomicFeatures::makeTxDbFromGFF(file = path_to_gtf,
format = "gtf",
dataSource = NA,
organism = NA)
temp_sequences <- Biostrings::readDNAStringSet(fastapath, format = "fasta", use.names = TRUE)
if(length(refseq_sep) != 0){
names(temp_sequences) <- tstrsplit(names(temp_sequences), refseq_sep, fixed = TRUE, keep = 1)[[1]]
}
exon <- suppressWarnings(GenomicFeatures::exonsBy(txdbanno, by = "tx", use.names = TRUE))
exon <- as.data.table(exon[unique(names(exon))])
sub_exon_plus <- exon[as.character(seqnames) %in% names(temp_sequences) & strand == "+"]
sub_exon_minus <- exon[as.character(seqnames) %in% names(temp_sequences) & strand == "-"
][, new_end := Biostrings::width(temp_sequences[as.character(seqnames)]) - start + 1
][, new_start := Biostrings::width(temp_sequences[as.character(seqnames)]) - end + 1]
seq_dt_plus <- sub_exon_plus[, nt_seq := "emp"
][, nt_seq := as.character(Biostrings::subseq(temp_sequences[as.character(seqnames)],
start = start,
end = end))
][, list(seq = paste(nt_seq, collapse = "")), by = group_name]
revcompl_temp_sequences <- Biostrings::reverseComplement(temp_sequences)
seq_dt_minus <- sub_exon_minus[, nt_seq := "emp"
][, nt_seq := as.character(Biostrings::subseq(revcompl_temp_sequences[as.character(seqnames)],
start = new_start,
end = new_end))
][, list(seq = paste(nt_seq, collapse = "")), by = group_name]
sequences <- Biostrings::DNAStringSet(c(seq_dt_plus$seq, seq_dt_minus$seq))
names(sequences) <- c(unique(sub_exon_plus$group_name), unique(sub_exon_minus$group_name))
Here are the specific transcripts in annotation_dt
> annotation_dt['ENST00000265077.8_1']
transcript l_tr l_utr5 l_cds l_utr3
1: ENST00000265077.8_1 12345 286 10191 1868
> annotation_dt['ENST00000485591.1_1']
transcript l_tr l_utr5 l_cds l_utr3
1: ENST00000485591.1_1 777 0 0 0
I checked if these two transcripts are present in another sample that doesn't throw an error, and they are indeed.
> reads_list[["FSF002_transcriptome"]] %>% filter (transcript == "ENST00000265077.8_1")
transcript end5 psite end3 length cds_start cds_stop psite_from_start psite_from_stop psite_region
1: ENST00000265077.8_1 43 NA 74 32 287 10477 NA NA 5utr
2: ENST00000265077.8_1 43 NA 74 32 287 10477 NA NA 5utr
3: ENST00000265077.8_1 70 NA 99 30 287 10477 NA NA 5utr
4: ENST00000265077.8_1 74 NA 100 27 287 10477 NA NA 5utr
5: ENST00000265077.8_1 76 NA 105 30 287 10477 NA NA 5utr
---
2873: ENST00000265077.8_1 10451 NA 10481 31 287 10477 NA NA 5utr
2874: ENST00000265077.8_1 10451 NA 10482 32 287 10477 NA NA 5utr
2875: ENST00000265077.8_1 10451 NA 10484 34 287 10477 NA NA 5utr
2876: ENST00000265077.8_1 10456 NA 10485 30 287 10477 NA NA 5utr
2877: ENST00000265077.8_1 11618 NA 11647 30 287 10477 NA NA 5utr
> reads_list[["FSF002_transcriptome"]] %>% filter (transcript == "ENST00000485591.1_1")
transcript end5 psite end3 length cds_start cds_stop psite_from_start psite_from_stop psite_region
1: ENST00000485591.1_1 390 NA 410 21 0 0 0 0 <NA>
2: ENST00000485591.1_1 390 NA 410 21 0 0 0 0 <NA>
3: ENST00000485591.1_1 390 NA 418 29 0 0 0 0 <NA>
4: ENST00000485591.1_1 390 NA 418 29 0 0 0 0 <NA>
5: ENST00000485591.1_1 390 NA 419 30 0 0 0 0 <NA>
---
1191: ENST00000485591.1_1 685 NA 710 26 0 0 0 0 <NA>
1192: ENST00000485591.1_1 686 NA 710 25 0 0 0 0 <NA>
1193: ENST00000485591.1_1 686 NA 710 25 0 0 0 0 <NA>
1194: ENST00000485591.1_1 688 NA 710 23 0 0 0 0 <NA>
1195: ENST00000485591.1_1 689 NA 710 22 0 0 0 0 <NA>
As for your code, it crashes with the following error (which is strange, because it is generic R and the manual is present). I'm trying to figure out what's wrong here.
> exon <- as.data.table(exon[unique(names(exon))])
Error in as.data.table(exon[unique(names(exon))]) :
could not find function "as.data.table"
The error seems related to the data.table package not loaded in your script.
As for the chunks of reads_list, there is clearly something wrong, and it might be (not 100% sure) related to the issue you are facing. You have psites set to NA. This is not possible, if you correctly followed the pipeline. I closed an issue a couple of days ago about this. Here the beginning of the discussion you should read. Let me know if your case is similar to that.
Hi Fabio,
My apologies, to find the transcript data from reads_table
, I just re-loaded it from bamtolist()
and didn't follow the subsequent steps, as I thought none would interfere with the data in reads_table
. When I do follow the rest of the steps, the psites are filled.
In the second transcript, however, the psite_region
still has NAs, which makes sense to me.
> reads_list[["FSF002_transcriptome"]] %>% filter (transcript == "ENST00000265077.8_1")
transcript end5 psite end3 length cds_start cds_stop psite_from_start psite_from_stop psite_region
1: ENST00000265077.8_1 43 56 74 32 287 10477 -231 -10421 5utr
2: ENST00000265077.8_1 43 56 74 32 287 10477 -231 -10421 5utr
3: ENST00000265077.8_1 70 82 99 30 287 10477 -205 -10395 5utr
4: ENST00000265077.8_1 74 86 100 27 287 10477 -201 -10391 5utr
5: ENST00000265077.8_1 76 88 105 30 287 10477 -199 -10389 5utr
---
2873: ENST00000265077.8_1 10451 10463 10481 31 287 10477 10176 -14 cds
2874: ENST00000265077.8_1 10451 10464 10482 32 287 10477 10177 -13 cds
2875: ENST00000265077.8_1 10451 10464 10484 34 287 10477 10177 -13 cds
2876: ENST00000265077.8_1 10456 10468 10485 30 287 10477 10181 -9 cds
2877: ENST00000265077.8_1 11618 11630 11647 30 287 10477 11343 1153 3utr
> reads_list[["FSF002_transcriptome"]] %>% filter (transcript == "ENST00000485591.1_1")
transcript end5 psite end3 length cds_start cds_stop psite_from_start psite_from_stop psite_region
1: ENST00000485591.1_1 390 402 410 21 0 0 0 0 <NA>
2: ENST00000485591.1_1 390 402 410 21 0 0 0 0 <NA>
3: ENST00000485591.1_1 390 402 418 29 0 0 0 0 <NA>
4: ENST00000485591.1_1 390 402 418 29 0 0 0 0 <NA>
5: ENST00000485591.1_1 390 402 419 30 0 0 0 0 <NA>
---
1191: ENST00000485591.1_1 685 697 710 26 0 0 0 0 <NA>
1192: ENST00000485591.1_1 686 698 710 25 0 0 0 0 <NA>
1193: ENST00000485591.1_1 686 698 710 25 0 0 0 0 <NA>
1194: ENST00000485591.1_1 688 700 710 23 0 0 0 0 <NA>
1195: ENST00000485591.1_1 689 701 710 22 0 0 0 0 <NA>
The previous issue you refer to may be partially related, indeed.
Initially, I had not added any filters yet, but calculated the psite for all reads, then wanted to use codon_usage_psite()
on reads of all lenghts. That caused the errors.
psite_offset <- psite(reads_list, plot = FALSE)
reads_psite_list <- psite_info(reads_list, psite_offset)
However, when I do add length filters, e.g., 21-11 and 27-29, (and thereby exclude the reads that cause errors?), all samples run successfully. The rest of the pipeline and drawing the codon frequency plots then works fine.
Using your code, it seems the transcripts are respectively 12345 (!) and 777 nucleotides in length.
> sequences$ENST00000265077.8_1
12345-letter DNAString object
seq: GCATTTGAACTTGCAGGCGAGCTGCCCCGAGCCTTTCTGGGGAAGAACTCCAGGCGTGCGGACGC...TCTTTGTAACTTTTTATATCTGCTTTTGTTTCACCAAAGAAACCTAAAATCCTTCTTTTACTACA
> sequences$ENST00000485591.1_1
777-letter DNAString object
seq: ATCTCACTATGTTGCCCAGCCTGGTCTCTAACTCCTGGCCTCAAGCAGTCCTCCCTGCCTCGGCC...AGGGCCGTTTGCACCCCTCCTTCAGCCTCGGCCCAGGGGGTGTGGGTGCCTGACCGGGTGGGAGG
In the second transcript, however, the psite_region still has NAs, which makes sense to me.
This make sense, since transcript ENST00000485591.1_1 has no annotated CDS and UTRs, so it is impossibile to assign a specific region.
However, when I do add length filters, e.g., 21-11 and 27-29, (and thereby exclude the reads that cause errors?), all samples run successfully. The rest of the pipeline and drawing the codon frequency plots then works fine.
This is somehow odd and I think it's a coincidence that removing specific lengths you solve the issue. The only way I can check this is for you to send me the data table throwing errors, the annotation data table, the FASTA file and the GTF you passed to codon_usage_psite(). It might be heavy, but at least I can run the whole function one line at a time and see where the problem pops out. Unless you yourself want to do this by taking advantage of the code available in GitHub.
Using your code, it seems the transcripts are respectively 12345 (!) and 777 nucleotides in length.
Okay so they are in line with the annotation data table and the position of the P-sites is not out of bounds.
I will share the files with you, debugging the source code is beyond my skill. I'm grateful you're willing to help to this extent.
I have prepared the script that throws the errors below. Please find all necessary files here.
library(devtools)
#install_github("LabTranslationalArchitectomics/riboWaltz", dependencies = TRUE)
library("riboWaltz")
library("Rcpp")
library("GenomicFeatures")
dir_bam <- "Data 2018/Aligned/hg19-STAR/transcriptome"
dir_wd <- "Analysis/R"
gtf <- "Data 2018/Genome/gencode.v38lift37.annotation.gtf"
setwd(dir_wd)
samples <- c("FSF001_transcriptome",
"FSF002_transcriptome",
"FSF003_transcriptome")
# Load GTF annotation
annotation_dt <- create_annotation(gtfpath = gtf)
# Load BAM files, include GTF annotation
reads_list <- bamtolist(bamfolder = dir_bam, annotation = annotation_dt, transcript_align = FALSE)
# Visualize read length distr
rlength_distr(reads_list, sample = samples)
# Filter for duplicates - IS THIS EVEN NECCESSARY??
filtered_list <- duplicates_filter(data = reads_list,
extremity = "both") # Reads are duplicates if they map on the same transcript and share both the 5' estremity and the 3' extremity.
# P-site offset
psite_offset <- psite(filtered_list, plot = FALSE)
reads_psite_list <- psite_info(filtered_list, psite_offset)
# Plots
## P-sites per region
psite_region <- region_psite(reads_psite_list,
annotation_dt,
sample = samples)
psite_region[["plot"]]
## Trinucleotide periodicity
frames <- frame_psite(reads_psite_list,
sample = samples,
region = "all")
frames[["plot"]]
## Metaplots
metaprofile_split <- metaprofile_psite(filtered_list,
annotation_dt,
sample = samples,
multisamples = "average",
plot_style = "split",
utr5l = 20, cdsl = 40, utr3l = 20,
frequency = TRUE,
plot_title = "transcript",
colour = c("aquamarine4", "gray70"))
metaprofile_split[["plot"]]
## Codon usage
cu_barplot_FSF001 <- codon_usage_psite(reads_psite_list,
annotation_dt,
site = "asite",
sample = c("FSF001_transcriptome"),
fastapath = "Data 2018/Genome/GRCh37.primary_assembly.genome.fa",
fasta_genome = TRUE,
refseq_sep = " ",
gtfpath = gtf,
frequency_normalization = FALSE,
label_aminoacid = TRUE
)
# Error in .Call2("C_solve_user_SEW", refwidths, start, end, width, translate.negative.coord, :
#solving row 639386: 'allow.nonnarrowing' is FALSE and the supplied end (318) is > refwidth
cu_barplot_FSF002 <- codon_usage_psite(reads_psite_list,
annotation_dt,
site = "asite",
sample = c("FSF002_transcriptome"),
fastapath = "Data 2018/Genome/GRCh37.primary_assembly.genome.fa",
fasta_genome = TRUE,
refseq_sep = " ",
gtfpath = gtf,
frequency_normalization = FALSE,
label_aminoacid = TRUE
)
# No error
cu_barplot_FSF002[["plot"]]
cu_barplot_FSF003 <- codon_usage_psite(reads_psite_list,
annotation_dt,
site = "asite",
sample = c("FSF003_transcriptome"),
fastapath = "Data 2018/Genome/GRCh37.primary_assembly.genome.fa",
fasta_genome = TRUE,
refseq_sep = " ",
gtfpath = gtf,
frequency_normalization = FALSE,
label_aminoacid = TRUE
)
#Error in .Call2("C_solve_user_SEW", refwidths, start, end, width, translate.negative.coord, :
#solving row 5611044: 'allow.nonnarrowing' is FALSE and the supplied start (974) is > refwidth + 1
Thank you for the extensive report of your script. I'll take a look as soon as possible.
Fabio
Okay, I got it. I tested the function with "FSF001_transcriptome", but I guess the problem is the same for "FSF003_transcriptome". Actually, you were almost right by suggesting that the length of the reads might be the problem. Specifically, it is a combination of three factors:
length total_percentage start_percentage around_start offset_from_5 offset_from_3 corrected_offset_from_5 corrected_offset_from_3
14 0.183 0.023 T NA NA 12 1
and the line of reads_psite_list[["FSF001_transcriptome"]] throwing the error:
transcript end5 psite end3 length cds_start cds_stop psite_from_start psite_from_stop psite_region
ENST00000493816.1_1 301 313 314 14 0 0 0 0 <NA>
where the first nucleotide of the P-site is at position 313 and the 3' extremity of the read (end3) is at position 314. Thus, the position of the three nucleotides of the P-site would be 313, 314 and 315 and for the A-site 316, 317 and 318. We can agree that biologically speaking this makes no sense at all because they are outside the footprint of the ribosome. But computationally speaking this would be fine, if not for the third point:
3) some of the 14nts-long reads map at the very end of the transcript sequence. For example, the length of the transcript associated to the error above is 315 nts. Thus, when riboWaltz tries to extract the tree A-site nucleotides for that specific read obviously it returns an error, because their position is outside the 3' limit of the transcript.
To be honest, I never faced this issue, possibly because I usually keep reads longer than 19 nts since biologically speaking it doesn't make sense to include shorter reads. But this also means that the probability to find the P-site too close to the extremity of the read is very low, if not null. But given this experience, I will think about additional controls to avoid the same problem in the future.
Meanwhile, I think you should be able to run everything just removing the 14nts-long reads, at least for "FSF001_transcriptome". For "FSF003_transcriptome" look at its rows in _psiteoffset and check if the problem might be the same and if there are other lengths showing the same behaviour.
Let me know. If everything is fine I will close the issue.
Best Fabio
Thank you, Fabio, for investigating the issue.
In none of the prior steps did I filter for lengths of the RPFs, as I didn't deem this necessary. This may have caused some noise reads mapped to the transcriptome (e.g., of length 14 nt). In the final analyses I was planning to filter for lenghts 21-22 and 27-29 anyways, so I didn't deem this necessary. As an initial check for quality of all data I wanted to run the function on the full dataset, when I encountered the error. By then I hadn't tried filtering for lengths yet.
I'm glad you've figured out the issue and happy to know it's very easily bypassed by filtering for the real lengths of RPFs.
This issue and/or a short notice in the manual would probably prevent others from encountering this in the future :-)
Best,
Gautam
Hi there,
I'm using riboWaltz for the first time as it offers all the funcitonality I need to analyse my ribosome sequencing data. Thank you for efforts in sharing your tool.
I have multiple datasets (bam files), aligned with STAR using
quantMode --TranscriptomeSAM
to hg19 (downloaded from Gencode) with the latest annotation (v38). I load them usingbamtolist()
.Drawing the plots and first analyses of the data work flawless, until I get to the
codon_usage_psite()
function.After running
codon_usage_psite()
, I get the following error:My code that throws the error:
I made absolutely sure that the fasta and GTF files are the exact same files I used for alignment with STAR. I added
refseq_sep = "[space]"
to overcome another error (data in fasta as:chr1[space]1
).Do you have an idea of what's going on, and what I have to do to fix it?
Thanks a lot,
Gautam
sessionInfo():