LabTranslationalArchitectomics / riboWaltz

optimization of ribosome P-site positioning in ribosome profiling data
MIT License
43 stars 10 forks source link

psite_info error #39

Closed yeroslaviz closed 3 years ago

yeroslaviz commented 3 years ago

Hi,

Using my own RiboSeq data I'm trying to work out the ribowaltz package. I am trying to add some information to the psites, but keep getting an error. As far as I can figure the error, it is not really directly related to ribowaltz, but google didn't help me hear, so I'm trying it with you.

this is the command I run:

reads_psite_list <- psite_info(data = reads_list, offset = psite_offset, site = c("psite", "asite", "esite"), fastapath = "DATA/genome/c_elegans.PRJNA13758.WS279.mRNA_transcripts.fa",fasta_genome = FALSE, refseq_sep = " " )

But I get this error each time:

processing IPA.Aligned.toTranscriptome.out.sorted
1. adding p-site position
2. adding transcript region
3. adding nucleotide sequence(s)
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'subseq': subscript contains invalid names

This has something to do with the search for the extra esite and asite, as running the command without them, results in no errors.

thanks for the help.

Assa

fabiolauria commented 3 years ago

Hi Assa, thank for reporting the issue. I'll go through the script and get back to you as soon I figure out what is going on.

Fabio

yeroslaviz commented 3 years ago

Hi Fabio,

thanks for the fast response.

Also I get another problem, when trying to create my own annotations table with a gff2/gff3 files. [If you prefer to open a new issue, let me know]

The annotations_table I used before was given to me from a college, but i want to better understand how the table is done, so i have downloaded the gff files

But when running the create_annotation() command I get this error:

> annot_dt <- create_annotation(gtfpath = "DATA/genome/c_elegans.PRJNA13758.WS279.annotations.gff2.gz", )
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error in .extract_genes_from_gff3_GRanges(gene_IDX, tx_IDX, mcols0$ID,  : 
  some genes have no "ID" attribute
In addition: Warning messages:
1: In .get_cds_IDX(mcols0$type, mcols0$phase) :
  some CDS phases are missing or not between 0 and 2
2: In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type coding_exon. This information was ignored.
3: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  some transcripts have no "Name" attribute ==> their name ("tx_name" column in the TxDb object) was set to NA
4: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  the transcript names ("tx_name" column in the TxDb object) imported from the "Name" attribute are not unique

I am not sure if this is something C. elegans specific, or not.

thanks again

Assa

fabiolauria commented 3 years ago

Hi Assa, I checked the code and run it with my own data. It worked, but I was able to reproduce the error in your first issue. Basically, it seems there is a mismatch between the name of the sequences in your FASTA file and the name of the transcripts in _readslist. For example, if you have "Actin" in one file and "Actin|transciript_info1|transciript_info2" or "Actin Actin001" in the other, you receive an error because the function subseq from Biostring doesn't know which sequence to cut. Usually, the problem is in the FASTA file. I noticed you specified the _refseqsep parameter, but are you sure everything is okay with that? You can run the following lines to look for potential mismatches:

library(data.table)
library(Biostrings)
fasta_sequences <- readDNAStringSet(PATH_TO_FASTA, format = "fasta", use.names = TRUE)
names(fasta_sequences) <- tstrsplit(names(fasta_sequences), REFSEQ_SEP, fixed = TRUE, keep = 1)[[1]]

transcript_names <- unique(reads_list[[1]]$transcript)

length(intersect(transcript_names, names(fasta_sequences)))
length(setdiff(transcript_names, names(fasta_sequences)))
length(setdiff(names(fasta_sequences), transcript_names))

The intersection should return the biggest set of gene names. If not, something is wrong.

As for the second issue, I'm not sure about the possible explanation. Most likely _createannotation doesn't like GFF. Should be similar to the GTF but the core function of _createannotation (that is GenomicFeatures::makeTxDbFromGFF(file = PATH_TO_GTF, format = "gtf", dataSource = dataSource, organism = organism)) expects a GTF format file (because of the format parameter set to "gtf"). Or maybe your GFF lacks of specific fields. To understand what is going on I would suggest to try with the GTF and, if it doesn't work, read the makeTxDbFromGFF documentation or inspect the GTF and GFF files.

I hope this can help you. Keep me updated.

Best Fabio

yeroslaviz commented 3 years ago

Hi Fabio,

thanks again.

Is there a reason to hard-code the "gtf" format? I have created a second function for the annotations, but only just changed the "gtf" to "auto" and than tested it with my gff file.

It worked, but I got a few warnings.

> test <- create_annotation.auto(gtfpath = "DATA/genome/c_elegans.PRJNA13758.WS279.annotations.gff3")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .get_cds_IDX(mcols0$type, mcols0$phase) :
  some CDS phases are missing or not between 0 and 2
2: In .extract_exons_from_GRanges(exon_IDX, gr, mcols0, tx_IDX, feature = "exon",  :
  158036 exons couldn't be linked to a transcript so were dropped (showing only the first 6):
  seqid start   end strand   ID Name                    Parent Parent_type
1     I  3782  3904      - <NA> <NA> Transcript:Y74C9A.6:wp166        <NA>
2     I 28280 28405      - <NA> <NA>        CDS:Y74C9A.5:wp159        <NA>
3     I 28280 28405      - <NA> <NA>        CDS:Y74C9A.5:wp154        <NA>
4     I 28916 29026      - <NA> <NA>        CDS:Y74C9A.5:wp154        <NA>
5     I 28916 29026      - <NA> <NA>        CDS:Y74C9A.5:wp159        <NA>
6     I 29100 29367      - <NA> <NA>        CDS:Y74C9A.5:wp154        <NA>
3: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  the transcript names ("tx_name" column in the TxDb object) imported from the "Name" attribute are not unique

These errors are explaniable, as my gff file doesn't contain any of these annotations:

##gff-version 3
##sequence-region I 1 15072434
##sequence-region II 1 15279421
##sequence-region III 1 13783801
##sequence-region IV 1 17493829
##sequence-region V 1 20924180
##sequence-region X 1 17718942
##sequence-region MtDNA 1 13794
I       BLAT_Nanopore_OTHER     expressed_sequence_match        1       4       92.1    -       .       ID=L2_Nanopore_Roach_1889.24;Target=L2_Nanopore_Roach_1889 358 361 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        5       60      92.1    -       .       ID=L2_Nanopore_Roach_1889.24;Target=L2_Nanopore_Roach_1889 301 356 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        63      69      92.1    -       .       ID=L2_Nanopore_Roach_1889.24;Target=L2_Nanopore_Roach_1889 294 300 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        70      140     92.1    -       .       ID=L2_Nanopore_Roach_1889.24;Target=L2_Nanopore_Roach_1889 206 276 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        164     244     92.1    -       .       ID=L2_Nanopore_Roach_1889.24;Target=L2_Nanopore_Roach_1889 125 205 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        250     254     92.1    -       .       ID=L2_Nanopore_Roach_1889.24;Target=L2_Nanopore_Roach_1889 119 123 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        255     269     92.1    -       .       ID=L2_Nanopore_Roach_1889.24;Target=L2_Nanopore_Roach_1889 102 116 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        271     350     92.1    -       .       ID=L2_Nanopore_Roach_1889.24;Target=L2_Nanopore_Roach_1889 22 101 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        645     655     92.1    -       .       ID=L2_Nanopore_Roach_1889.24;Target=L2_Nanopore_Roach_1889 11 21 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        657     666     92.1    -       .       ID=L2_Nanopore_Roach_1889.24;Target=L2_Nanopore_Roach_1889 1 10 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        1       6       10.7    -       .       ID=L3_Nanopore_Roach_70374.46;Target=L3_Nanopore_Roach_70374 344 349 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        7       99      10.7    -       .       ID=L3_Nanopore_Roach_70374.46;Target=L3_Nanopore_Roach_70374 245 337 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        1       6       11.3    -       .       ID=L3_Nanopore_Roach_33306.57;Target=L3_Nanopore_Roach_33306 350 355 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        7       105     11.3    -       .       ID=L3_Nanopore_Roach_33306.57;Target=L3_Nanopore_Roach_33306 245 343 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        1       6       8.1     +       .       ID=young_adult_Nanopore_Roach_81465.60;Target=young_adult_Nanopore_Roach_81465 165 170 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        7       99      8.1     +       .       ID=young_adult_Nanopore_Roach_81465.60;Target=young_adult_Nanopore_Roach_81465 177 269 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        1       10      6.2     -       .       ID=L1_Nanopore_Li_127601.19;Target=L1_Nanopore_Li_127601 828 837 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        11      25      6.2     -       .       ID=L1_Nanopore_Li_127601.19;Target=L1_Nanopore_Li_127601 783 797 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        26      42      6.2     -       .       ID=L1_Nanopore_Li_127601.19;Target=L1_Nanopore_Li_127601 760 776 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        216     285     6.2     -       .       ID=L1_Nanopore_Li_127601.19;Target=L1_Nanopore_Li_127601 661 730 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        1       16      7.7     +       .       ID=L1_Nanopore_Roach_71248.76;Target=L1_Nanopore_Roach_71248 692 707 +;own_sequence=TRUE
I       BLAT_Nanopore_OTHER     expressed_sequence_match        17      72      7.7     +       .       ID=L1_Nanopore_Roach_71248.76;Target=L1_Nanopore_Roach_71248 714 769 +;own_sequence=TRUE

But at least it seems to be to work with the "auto" parameter. Is it possible to update this? or did you test it beforehand?

About the First problem. This is what I get when running the three last commands:

> length(intersect(transcript_names, names(fasta_sequences)))
[1] 7237
> length(setdiff(transcript_names, names(fasta_sequences)))
[1] 1177
> length(setdiff(names(fasta_sequences), transcript_names))
[1] 24531

So If I understand it correctly, I have 7237 common names, 1177 names unique to my transcript names and 24541(!) names unique to my fasta file.

Looking at the names, they are not different,but they are just not there.

x <- list( "transcript_names"            = as.character(transcript_names),
       "fasta_sequences"       = names(fasta_sequences) )

>jnk <- VennDiagram::get.venn.partitions(x)
> str(jnk$..values..)
List of 3
$ 1: chr [1:7237] "F15H9.2.1" "F14B6.4.1" "C52E2.7.1" "F44F4.4.1" ...
$ 2: chr [1:24531] "2L52.1a.1" "2L52.1b.1" "2RSSE.1a.1" "2RSSE.1b.1" ...
$ 3: chr [1:1177] "F38E1.12" "W06G6.13" "C53D5.5a.2" "F32B5.7b.2" ...

I can't figure out the error though. Why doesn't it work for the ~7k common transcripts?

thanks Assa

fabiolauria commented 3 years ago

Hi Assa.

Is there a reason to hard-code the "gtf" format?

Is it possible to update this? or did you test it beforehand?

Actually there is no reason I can remember as why I decided to force the use of the GTF format. I might have considered different options and I decided to stick with the GTF. But if _createannotation works with the GFF as well, I can surely update it. I'll test it with multiple files, just to be sure, and than upload the new version on github.

About the first issue, riboWaltz doesn't work because it tries to extract nucleotide triplets for ALL the transcripts listed in reads_list. However, since 1177 are not present in your FASTA file, it fails and stops. Since it looks like the transcript names reported in the FASTA and in reads_list are of the same format, I can only hypothesize reads_list comes from BAM files based on different FASTA sequences. I might be wrong, but I think in a different scenario it would been impossible to have even a single transcript associated to mapping reads (i.e. listed in reads_list) but not present in the FASTA (including the reference sequences for the alignment). This is the reason why I always recommend to use in riboWaltz the very same files used for the alignments. Otherwise it's so easy to get stuck because of mismatches.
The opposite situation (about the 24541 transcripts in the FASTA but not in _readslist) happens more often, e.g. according to the depth of the sequencing.

Fabio

yeroslaviz commented 3 years ago

thanks, that make sense. I think.

mimonti commented 3 months ago

Thanks for this help Fabio! Just in case it helps someone, I was having this issue for exactly the reasons Fabio listed above- some transcript ID's were missing in my transcript cDNA file (i.e. 'gencode.vM34.transcripts.fa'). Strangely, these ID were present in the gtf file ('gencode.vM34.primary_assembly.annotation.gtf') and I used the same files for ribowaltz.

I quickly checked some of them manually and they are all 'novel transcripts', so perhaps they were not included in the transcriptome? Anyway, a simple filter allowed me to pass to the function. It was no more than 4-5 transcripts per library.