LabTranslationalArchitectomics / riboWaltz

optimization of ribosome P-site positioning in ribosome profiling data
MIT License
46 stars 12 forks source link

Error using codon_usage_psite() #45

Closed GKok closed 3 years ago

GKok commented 3 years ago

Hi there,

I'm using riboWaltz for the first time as it offers all the funcitonality I need to analyse my ribosome sequencing data. Thank you for efforts in sharing your tool.

I have multiple datasets (bam files), aligned with STAR using quantMode --TranscriptomeSAM to hg19 (downloaded from Gencode) with the latest annotation (v38). I load them using bamtolist().

Drawing the plots and first analyses of the data work flawless, until I get to the codon_usage_psite() function.

After running codon_usage_psite(), I get the following error:

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Error in .Call2("C_solve_user_SEW", refwidths, start, end, width, translate.negative.coord,  : 
  solving row 3172563: 'allow.nonnarrowing' is FALSE and the supplied end (973) is > refwidth

My code that throws the error:

library("riboWaltz")
library("Rcpp")
library("GenomicFeatures")

dir_bam <- "/path/to/transcriptome"
gtf <- "/path/to/Genome/gencode.v38lift37.annotation.gtf"

annotation_dt <- create_annotation(gtfpath = gtf)
reads_list <- bamtolist(bamfolder = dir_bam, annotation = annotation_dt, transcript_align = FALSE)

psite_offset <- psite(reads_list, plot = TRUE)
reads_psite_list <- psite_info(filtered_list, psite_offset)

## Codon usage
cu_barplot <- codon_usage_psite(reads_psite_list, 
                                annotation_dt, 
                                site = "psite", 
                                sample = c("SAMPLE_transcriptome"),
                                fastapath = "/path/to/Genome/GRCh37.primary_assembly.genome.fa",
                                fasta_genome = TRUE,
                                refseq_sep = " ",
                                gtfpath = gtf,
                                frequency_normalization = FALSE,
                                label_aminoacid = TRUE
)  # Throws an error

I made absolutely sure that the fasta and GTF files are the exact same files I used for alignment with STAR. I added refseq_sep = "[space]" to overcome another error (data in fasta as: chr1[space]1).

Do you have an idea of what's going on, and what I have to do to fix it?

Thanks a lot,

Gautam

sessionInfo():

Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS 12.0.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] nl_NL.UTF-8/nl_NL.UTF-8/nl_NL.UTF-8/C/nl_NL.UTF-8/nl_NL.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GenomicFeatures_1.42.3 AnnotationDbi_1.52.0   Biobase_2.50.0         GenomicRanges_1.42.0   GenomeInfoDb_1.26.7   
 [6] IRanges_2.24.1         S4Vectors_0.28.1       BiocGenerics_0.36.1    devtools_2.4.2         usethis_2.1.3   
GKok commented 3 years ago

Update: it is sample specific.

When using the second bam file, the error disappears and I get the codon frequency plot, as expected.

The third file throws the exact same error for a different row:

Error in .Call2("C_solve_user_SEW", refwidths, start, end, width, translate.negative.coord,  : 
  solving row 5611044: 'allow.nonnarrowing' is FALSE and the supplied end (973) is > refwidth

However, my question remains relevant, as I still would like to analyse all samples.

fabiolauria commented 3 years ago

Hi Gautam, thank you for using riboWaltz.

First of all, is there any reason for using reads_list in the psite_offset() function and a filtered_list in the psite_info() function? Is it just an error? It should be:

psite_offset <- psite(reads_list, plot = TRUE)
reads_psite_list <- psite_info(reads_list, psite_offset)

Since the error you reported seems to be row-specific, could you check line 3172563 of _reads_psite_list [["SAMPLEtranscriptome"]] and see if there is something different wrt other rows? And at the same time could you check the length of the transcript corresponding to that row and if the psite position is falls within the transcript? I think it a matter of non-matching lengths, and this hypothesis is more or less confirmed by other issues in the web.

Best Fabio

GKok commented 3 years ago

Dear Fabio,

Thank you for your prompt response.

I did add a filter-step in between, but to keep my question short and to the point, I removed most of my code. I simply forgot to change filtered_list to reads_list. Well-noticed :-)

Here's the reads_list output of the two lines that cause the error, with two lines before and after.

> reads_list[["FSF001_transcriptome"]][3172561:3172565,]
            transcript end5 end3 length cds_start cds_stop
1: ENST00000513016.5_1 1958 1984     27         0        0
2: ENST00000513960.5_1 1860 1886     27       254     3973
3: ENST00000265077.8_1 4854 4880     27       287    10477
4: ENST00000343200.9_1 1963 1989     27       357     7586
5: ENST00000369085.8_1 1198 1225     28       297     2024

> reads_list[["FSF003_transcriptome"]][5611042:5611046,]
            transcript end5 end3 length cds_start cds_stop
1: ENST00000531493.5_1  285  314     30        81      788
2: ENST00000312562.7_1  254  283     30        50      865
3: ENST00000485591.1_1  587  616     30         0        0
4: ENST00000409416.6_1  931  960     30         1     2487
5: ENST00000397763.5_1  958  987     30        28     2784

Anything special about these values?

Gautam

fabiolauria commented 3 years ago

Nothing special about those values, not really. But it is also true that in the first case cds_stop is veeeery high while in the second case is 0, so maybe... For "ENST00000265077.8_1" and "ENST00000485591.1_1" (the transcript showing errors) could you please 1) show me their corresponding lines in annotation_dt and 2) check if the same transcripts are present in those samples not throwing errors and 3) report the width of their sequence stored in the fasta? For the latter task here is the code to extract the transcript sequences from the genomic sequences (if it doesn't work, I might have forgotten some pieces. Just load all dependencies and check what happens):

fastapath <- "/path/to/Genome/GRCh37.primary_assembly.genome.fa"
refseq_sep = " "
gtfpath = gtf

path_to_gtf <- gtfpath
txdbanno <- GenomicFeatures::makeTxDbFromGFF(file = path_to_gtf,
                                             format = "gtf",
                                             dataSource = NA,
                                             organism = NA)

temp_sequences <- Biostrings::readDNAStringSet(fastapath, format = "fasta", use.names = TRUE)
if(length(refseq_sep) != 0){
  names(temp_sequences) <- tstrsplit(names(temp_sequences), refseq_sep, fixed = TRUE, keep = 1)[[1]]
}
exon <- suppressWarnings(GenomicFeatures::exonsBy(txdbanno, by = "tx", use.names = TRUE))
exon <- as.data.table(exon[unique(names(exon))])
sub_exon_plus <- exon[as.character(seqnames) %in% names(temp_sequences) & strand == "+"]
sub_exon_minus <- exon[as.character(seqnames) %in% names(temp_sequences) & strand == "-"
                       ][, new_end := Biostrings::width(temp_sequences[as.character(seqnames)]) - start + 1
                         ][, new_start := Biostrings::width(temp_sequences[as.character(seqnames)]) - end + 1]

seq_dt_plus <- sub_exon_plus[, nt_seq := "emp"
                             ][, nt_seq := as.character(Biostrings::subseq(temp_sequences[as.character(seqnames)],
                                                                           start = start,
                                                                           end = end))
                               ][, list(seq = paste(nt_seq, collapse = "")), by = group_name]

revcompl_temp_sequences <- Biostrings::reverseComplement(temp_sequences)
seq_dt_minus <- sub_exon_minus[, nt_seq := "emp"
                               ][, nt_seq := as.character(Biostrings::subseq(revcompl_temp_sequences[as.character(seqnames)],
                                                                             start = new_start,
                                                                             end = new_end))
                                 ][, list(seq = paste(nt_seq, collapse = "")), by = group_name]

sequences <- Biostrings::DNAStringSet(c(seq_dt_plus$seq, seq_dt_minus$seq))
names(sequences) <- c(unique(sub_exon_plus$group_name), unique(sub_exon_minus$group_name))
GKok commented 3 years ago

Here are the specific transcripts in annotation_dt

> annotation_dt['ENST00000265077.8_1']
            transcript  l_tr l_utr5 l_cds l_utr3
1: ENST00000265077.8_1 12345    286 10191   1868
> annotation_dt['ENST00000485591.1_1']
            transcript l_tr l_utr5 l_cds l_utr3
1: ENST00000485591.1_1  777      0     0      0

I checked if these two transcripts are present in another sample that doesn't throw an error, and they are indeed.

> reads_list[["FSF002_transcriptome"]] %>% filter (transcript == "ENST00000265077.8_1")
               transcript  end5 psite  end3 length cds_start cds_stop psite_from_start psite_from_stop psite_region
   1: ENST00000265077.8_1    43    NA    74     32       287    10477               NA              NA         5utr
   2: ENST00000265077.8_1    43    NA    74     32       287    10477               NA              NA         5utr
   3: ENST00000265077.8_1    70    NA    99     30       287    10477               NA              NA         5utr
   4: ENST00000265077.8_1    74    NA   100     27       287    10477               NA              NA         5utr
   5: ENST00000265077.8_1    76    NA   105     30       287    10477               NA              NA         5utr
  ---                                                                                                              
2873: ENST00000265077.8_1 10451    NA 10481     31       287    10477               NA              NA         5utr
2874: ENST00000265077.8_1 10451    NA 10482     32       287    10477               NA              NA         5utr
2875: ENST00000265077.8_1 10451    NA 10484     34       287    10477               NA              NA         5utr
2876: ENST00000265077.8_1 10456    NA 10485     30       287    10477               NA              NA         5utr
2877: ENST00000265077.8_1 11618    NA 11647     30       287    10477               NA              NA         5utr
> reads_list[["FSF002_transcriptome"]] %>% filter (transcript == "ENST00000485591.1_1")
               transcript end5 psite end3 length cds_start cds_stop psite_from_start psite_from_stop psite_region
   1: ENST00000485591.1_1  390    NA  410     21         0        0                0               0         <NA>
   2: ENST00000485591.1_1  390    NA  410     21         0        0                0               0         <NA>
   3: ENST00000485591.1_1  390    NA  418     29         0        0                0               0         <NA>
   4: ENST00000485591.1_1  390    NA  418     29         0        0                0               0         <NA>
   5: ENST00000485591.1_1  390    NA  419     30         0        0                0               0         <NA>
  ---                                                                                                            
1191: ENST00000485591.1_1  685    NA  710     26         0        0                0               0         <NA>
1192: ENST00000485591.1_1  686    NA  710     25         0        0                0               0         <NA>
1193: ENST00000485591.1_1  686    NA  710     25         0        0                0               0         <NA>
1194: ENST00000485591.1_1  688    NA  710     23         0        0                0               0         <NA>
1195: ENST00000485591.1_1  689    NA  710     22         0        0                0               0         <NA>

As for your code, it crashes with the following error (which is strange, because it is generic R and the manual is present). I'm trying to figure out what's wrong here.

> exon <- as.data.table(exon[unique(names(exon))])
Error in as.data.table(exon[unique(names(exon))]) : 
  could not find function "as.data.table"
fabiolauria commented 3 years ago

The error seems related to the data.table package not loaded in your script.

As for the chunks of reads_list, there is clearly something wrong, and it might be (not 100% sure) related to the issue you are facing. You have psites set to NA. This is not possible, if you correctly followed the pipeline. I closed an issue a couple of days ago about this. Here the beginning of the discussion you should read. Let me know if your case is similar to that.

GKok commented 3 years ago

Hi Fabio,

My apologies, to find the transcript data from reads_table, I just re-loaded it from bamtolist() and didn't follow the subsequent steps, as I thought none would interfere with the data in reads_table. When I do follow the rest of the steps, the psites are filled.

In the second transcript, however, the psite_region still has NAs, which makes sense to me.

> reads_list[["FSF002_transcriptome"]] %>% filter (transcript == "ENST00000265077.8_1")
               transcript  end5 psite  end3 length cds_start cds_stop psite_from_start psite_from_stop psite_region
   1: ENST00000265077.8_1    43    56    74     32       287    10477             -231          -10421         5utr
   2: ENST00000265077.8_1    43    56    74     32       287    10477             -231          -10421         5utr
   3: ENST00000265077.8_1    70    82    99     30       287    10477             -205          -10395         5utr
   4: ENST00000265077.8_1    74    86   100     27       287    10477             -201          -10391         5utr
   5: ENST00000265077.8_1    76    88   105     30       287    10477             -199          -10389         5utr
  ---                                                                                                              
2873: ENST00000265077.8_1 10451 10463 10481     31       287    10477            10176             -14          cds
2874: ENST00000265077.8_1 10451 10464 10482     32       287    10477            10177             -13          cds
2875: ENST00000265077.8_1 10451 10464 10484     34       287    10477            10177             -13          cds
2876: ENST00000265077.8_1 10456 10468 10485     30       287    10477            10181              -9          cds
2877: ENST00000265077.8_1 11618 11630 11647     30       287    10477            11343            1153         3utr
> reads_list[["FSF002_transcriptome"]] %>% filter (transcript == "ENST00000485591.1_1")
               transcript end5 psite end3 length cds_start cds_stop psite_from_start psite_from_stop psite_region
   1: ENST00000485591.1_1  390   402  410     21         0        0                0               0         <NA>
   2: ENST00000485591.1_1  390   402  410     21         0        0                0               0         <NA>
   3: ENST00000485591.1_1  390   402  418     29         0        0                0               0         <NA>
   4: ENST00000485591.1_1  390   402  418     29         0        0                0               0         <NA>
   5: ENST00000485591.1_1  390   402  419     30         0        0                0               0         <NA>
  ---                                                                                                            
1191: ENST00000485591.1_1  685   697  710     26         0        0                0               0         <NA>
1192: ENST00000485591.1_1  686   698  710     25         0        0                0               0         <NA>
1193: ENST00000485591.1_1  686   698  710     25         0        0                0               0         <NA>
1194: ENST00000485591.1_1  688   700  710     23         0        0                0               0         <NA>
1195: ENST00000485591.1_1  689   701  710     22         0        0                0               0         <NA>

The previous issue you refer to may be partially related, indeed. Initially, I had not added any filters yet, but calculated the psite for all reads, then wanted to use codon_usage_psite() on reads of all lenghts. That caused the errors.

psite_offset <- psite(reads_list, plot = FALSE)
reads_psite_list <- psite_info(reads_list, psite_offset)

However, when I do add length filters, e.g., 21-11 and 27-29, (and thereby exclude the reads that cause errors?), all samples run successfully. The rest of the pipeline and drawing the codon frequency plots then works fine.

Using your code, it seems the transcripts are respectively 12345 (!) and 777 nucleotides in length.

> sequences$ENST00000265077.8_1
12345-letter DNAString object
seq: GCATTTGAACTTGCAGGCGAGCTGCCCCGAGCCTTTCTGGGGAAGAACTCCAGGCGTGCGGACGC...TCTTTGTAACTTTTTATATCTGCTTTTGTTTCACCAAAGAAACCTAAAATCCTTCTTTTACTACA
> sequences$ENST00000485591.1_1
777-letter DNAString object
seq: ATCTCACTATGTTGCCCAGCCTGGTCTCTAACTCCTGGCCTCAAGCAGTCCTCCCTGCCTCGGCC...AGGGCCGTTTGCACCCCTCCTTCAGCCTCGGCCCAGGGGGTGTGGGTGCCTGACCGGGTGGGAGG
fabiolauria commented 3 years ago

In the second transcript, however, the psite_region still has NAs, which makes sense to me.

This make sense, since transcript ENST00000485591.1_1 has no annotated CDS and UTRs, so it is impossibile to assign a specific region.

However, when I do add length filters, e.g., 21-11 and 27-29, (and thereby exclude the reads that cause errors?), all samples run successfully. The rest of the pipeline and drawing the codon frequency plots then works fine.

This is somehow odd and I think it's a coincidence that removing specific lengths you solve the issue. The only way I can check this is for you to send me the data table throwing errors, the annotation data table, the FASTA file and the GTF you passed to codon_usage_psite(). It might be heavy, but at least I can run the whole function one line at a time and see where the problem pops out. Unless you yourself want to do this by taking advantage of the code available in GitHub.

Using your code, it seems the transcripts are respectively 12345 (!) and 777 nucleotides in length.

Okay so they are in line with the annotation data table and the position of the P-sites is not out of bounds.

GKok commented 3 years ago

I will share the files with you, debugging the source code is beyond my skill. I'm grateful you're willing to help to this extent.

I have prepared the script that throws the errors below. Please find all necessary files here.

library(devtools)
#install_github("LabTranslationalArchitectomics/riboWaltz", dependencies = TRUE)

library("riboWaltz")
library("Rcpp")
library("GenomicFeatures")

dir_bam <- "Data 2018/Aligned/hg19-STAR/transcriptome"
dir_wd <- "Analysis/R"
gtf <- "Data 2018/Genome/gencode.v38lift37.annotation.gtf"

setwd(dir_wd)

samples <- c("FSF001_transcriptome",
             "FSF002_transcriptome",
             "FSF003_transcriptome")

# Load GTF annotation
annotation_dt <- create_annotation(gtfpath = gtf)

# Load BAM files, include GTF annotation
reads_list <- bamtolist(bamfolder = dir_bam, annotation = annotation_dt, transcript_align = FALSE)

# Visualize read length distr
rlength_distr(reads_list, sample = samples)

# Filter for duplicates - IS THIS EVEN NECCESSARY??
filtered_list <- duplicates_filter(data = reads_list,
                                   extremity = "both") # Reads are duplicates if they map on the same transcript and share both the 5' estremity and the 3' extremity.

# P-site offset
psite_offset <- psite(filtered_list, plot = FALSE)
reads_psite_list <- psite_info(filtered_list, psite_offset)

# Plots
## P-sites per region
psite_region <- region_psite(reads_psite_list, 
                             annotation_dt, 
                             sample = samples)
psite_region[["plot"]]

## Trinucleotide periodicity
frames <- frame_psite(reads_psite_list, 
                      sample = samples, 
                      region = "all")
frames[["plot"]]

## Metaplots

metaprofile_split <- metaprofile_psite(filtered_list, 
                                       annotation_dt, 
                                       sample = samples,
                                       multisamples = "average", 
                                       plot_style = "split",
                                       utr5l = 20, cdsl = 40, utr3l = 20,
                                       frequency = TRUE, 
                                       plot_title = "transcript",
                                       colour = c("aquamarine4", "gray70"))
metaprofile_split[["plot"]]

## Codon usage
cu_barplot_FSF001 <- codon_usage_psite(reads_psite_list, 
                                       annotation_dt, 
                                       site = "asite", 
                                       sample = c("FSF001_transcriptome"),
                                       fastapath = "Data 2018/Genome/GRCh37.primary_assembly.genome.fa",
                                       fasta_genome = TRUE,
                                       refseq_sep = " ",
                                       gtfpath = gtf,
                                       frequency_normalization = FALSE,
                                       label_aminoacid = TRUE
) 
# Error in .Call2("C_solve_user_SEW", refwidths, start, end, width, translate.negative.coord,  : 
#solving row 639386: 'allow.nonnarrowing' is FALSE and the supplied end (318) is > refwidth

cu_barplot_FSF002 <- codon_usage_psite(reads_psite_list, 
                                       annotation_dt, 
                                       site = "asite", 
                                       sample = c("FSF002_transcriptome"),
                                       fastapath = "Data 2018/Genome/GRCh37.primary_assembly.genome.fa",
                                       fasta_genome = TRUE,
                                       refseq_sep = " ",
                                       gtfpath = gtf,
                                       frequency_normalization = FALSE,
                                       label_aminoacid = TRUE
) 
# No error
cu_barplot_FSF002[["plot"]]

cu_barplot_FSF003 <- codon_usage_psite(reads_psite_list, 
                                       annotation_dt, 
                                       site = "asite", 
                                       sample = c("FSF003_transcriptome"),
                                       fastapath = "Data 2018/Genome/GRCh37.primary_assembly.genome.fa",
                                       fasta_genome = TRUE,
                                       refseq_sep = " ",
                                       gtfpath = gtf,
                                       frequency_normalization = FALSE,
                                       label_aminoacid = TRUE
) 
#Error in .Call2("C_solve_user_SEW", refwidths, start, end, width, translate.negative.coord,  : 
#solving row 5611044: 'allow.nonnarrowing' is FALSE and the supplied start (974) is > refwidth + 1
fabiolauria commented 3 years ago

Thank you for the extensive report of your script. I'll take a look as soon as possible.

Fabio

fabiolauria commented 3 years ago

Okay, I got it. I tested the function with "FSF001_transcriptome", but I guess the problem is the same for "FSF003_transcriptome". Actually, you were almost right by suggesting that the length of the reads might be the problem. Specifically, it is a combination of three factors:

  1. your data includes reads of 14 nts. (Comment not related to the issue: I didn't notice this before, but I must say 14 nts are too few to be representative of eukaryotic ribosomes);
  2. the optimal offset is 12 nts from the 5' end of the reads. Now, given i) the low amount of reads of 14 nts (0.023%) and ii) the subsequent inability of riboWaltz to compute an offset whatsoever for these reads, the correction step assigns to them the optimal offset by default. This means that the firts nucleotide of the P-site correspond to the second to last nucleotide of the read. You can verify this from the first line and last two columns of _psiteoffset
length total_percentage start_percentage around_start offset_from_5 offset_from_3 corrected_offset_from_5 corrected_offset_from_3
14            0.183            0.023            T            NA            NA                      12                       1

and the line of reads_psite_list[["FSF001_transcriptome"]] throwing the error:

transcript end5 psite end3 length cds_start cds_stop psite_from_start psite_from_stop psite_region
ENST00000493816.1_1  301   313  314     14         0        0                0               0         <NA>

where the first nucleotide of the P-site is at position 313 and the 3' extremity of the read (end3) is at position 314. Thus, the position of the three nucleotides of the P-site would be 313, 314 and 315 and for the A-site 316, 317 and 318. We can agree that biologically speaking this makes no sense at all because they are outside the footprint of the ribosome. But computationally speaking this would be fine, if not for the third point:

3) some of the 14nts-long reads map at the very end of the transcript sequence. For example, the length of the transcript associated to the error above is 315 nts. Thus, when riboWaltz tries to extract the tree A-site nucleotides for that specific read obviously it returns an error, because their position is outside the 3' limit of the transcript.

To be honest, I never faced this issue, possibly because I usually keep reads longer than 19 nts since biologically speaking it doesn't make sense to include shorter reads. But this also means that the probability to find the P-site too close to the extremity of the read is very low, if not null. But given this experience, I will think about additional controls to avoid the same problem in the future.

Meanwhile, I think you should be able to run everything just removing the 14nts-long reads, at least for "FSF001_transcriptome". For "FSF003_transcriptome" look at its rows in _psiteoffset and check if the problem might be the same and if there are other lengths showing the same behaviour.

Let me know. If everything is fine I will close the issue.

Best Fabio

GKok commented 3 years ago

Thank you, Fabio, for investigating the issue.

In none of the prior steps did I filter for lengths of the RPFs, as I didn't deem this necessary. This may have caused some noise reads mapped to the transcriptome (e.g., of length 14 nt). In the final analyses I was planning to filter for lenghts 21-22 and 27-29 anyways, so I didn't deem this necessary. As an initial check for quality of all data I wanted to run the function on the full dataset, when I encountered the error. By then I hadn't tried filtering for lengths yet.

I'm glad you've figured out the issue and happy to know it's very easily bypassed by filtering for the real lengths of RPFs.

This issue and/or a short notice in the manual would probably prevent others from encountering this in the future :-)

Best,

Gautam