Closed zzo-joe closed 1 year ago
Hi Daeho, thank you for using riboWaltz.
According to what you wrote, I guess the reference FASTA used in your alignment is made of chromosome sequences, not transcript sequences. Otherwise, the STAR option -quantMode TranscriptomeSAM wouldn't make much sense because the resulting BAM would have already been based on transcript coordinates.
If it's like this, you shouldn't have passed a transcript-based FASTA to _codon_usagepsite, which might lead to mismatches in the name of the sequences and to the error you reported. Instead, you should pass the very same FASTA (chromosome-based) you used in the alignment and set _fastagenome = TRUE .
On the other hand, if you actually aligned on transcript-based FASTA, I would use the "original BAM" (= not the one returned by -quantMode TranscriptomeSAM) from the beginning. This way you are sure no mismatch will arise (except for potential issues related to characters which should be removed through the _refseqsep parameter).
Let me know how it goes.
Best Fabio
Thank you for your kind answer.
Following your suggestion, I have changed "fasta_genome = TRUE" and also added gtfpath.
Still, same error message is showing up..
Here is modified commands.
cu_barplot <- codon_usage_psite(reads_psite_list,
annotation_table,
sample = "RPF2",
fastapath = "~/Documents/reference/GRCh38.p14.genome.fa",
fasta_genome = TRUE,
gtfpath = "~/Documents/reference/gencode.v43.chr_patch_hapl_scaff.annotation.gtf",
frequency_normalization = FALSE)
For STAR commands, I used those fasta and gtf files for generating index (--runMode genomeGenerate) and mapped reads with following commands.
STAR --genomeDir /mnt/NAS/reference/STAR_gencode_hg38_v43 \
--runThreadN 16 \
--readFilesCommand zcat \
--readFilesIn $PWD/${i}.fastq.gz \
--outFileNamePrefix /mnt/NAS/primary_output/${i} \
--twopassMode Basic \
--outSAMtype BAM SortedByCoordinate \
--quantMode TranscriptomeSAM
Thanks, Daeho
Hi Daeho. from what I see there is at least one potential issue in your code. The best way to minimize errors popping up because of discrepancies across GTF and FASTA is to get them from the same version of the annotation. Here, you are using _gencode.v43.chr_patch_haplscaff.annotation.gtf from GENCODE human release 43 and GRCh38.p14.genome.fa from GENCODE human release 44. This might cause your error.
If that is not the case, to understand what is going on I would need to look to a couple of sequence names from the FASTA file as well as the first rows of the GTF, of one element of reads_psite_list and of annotation_dt (as in Issue #69). BTW, keep in mind the _refseqsep parameter I mentioned above, which is quite often the culprit for many similar issues.
Best Fabio
Sorry for the late reply. I am currently going thorough reference files saved in the workstation of our lab. So far, I am pretty sure that I have used gencode v43 for alignment, but not sure why gencode v44 file was included in the folder.. I will re-try with gencode v43 reference directly from the source and post the result soon.
I have changed reference both gtf file and fasta file for v43 and v44 respectively, but got the same error.
> ## Codon usage (E)
> cu_barplot <- codon_usage_psite(reads_psite_list,
+ annotation_table,
+ sample = "RPF2",
+ fastapath = "~/Documents/reference/GRCh38.p13.genome.fa", ## p14
+ fasta_genome = TRUE,
+ gtfpath = "~/Documents/reference/gencode.v43.chr_patch_hapl_scaff.annotation.gtf", ## v44
+ frequency_normalization = FALSE)
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'subseq': subscript contains invalid names
For some row in fasta file
$ cat GRCh38.p13.genome.fa| grep ">" | less
>chr1 1
>chr2 2
>chr3 3
>chr4 4
>chr5 5
>chr6 6
>chr7 7
>chr8 8
~~~
>chrX X
>chrY Y
>chrM MT
>GL000008.2 GL000008.2
>GL000009.2 GL000009.2
>GL000194.1 GL000194.1
>GL000195.1 GL000195.1
>GL000205.2 GL000205.2
>GL000208.1 GL000208.1
>GL000213.1 GL000213.1
>GL000214.1 GL000214.1
>GL000216.2 GL000216.2
For gtf file
$cat gencode.v43.chr_patch_hapl_scaff.annotation.gtf| less
##description: evidence-based annotation of the human genome (GRCh38), version 43 (Ensembl 109)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2022-11-29
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000290825.1"; gene_type "lncRNA"; gene_name "DDX11L2"; level 2; tag "overlaps_pseudogene";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
Some lines for read_psite_list
> reads_psite_list$RPF2
transcript end5 psite end3 length cds_start cds_stop
1: ENST00000000233.10 85 97 115 31 89 631
2: ENST00000000233.10 106 118 135 30 89 631
3: ENST00000000233.10 106 118 135 30 89 631
4: ENST00000000233.10 106 118 135 30 89 631
5: ENST00000000233.10 106 118 135 30 89 631
---
77987585: ENST00000707138.1 1133 1145 1163 31 1 1536
77987586: ENST00000707138.1 1133 1145 1163 31 1 1536
77987587: ENST00000707138.1 1163 1175 1193 31 1 1536
77987588: ENST00000707138.1 1166 1178 1196 31 1 1536
77987589: ENST00000707138.1 1498 1510 1528 31 1 1536
psite_from_start psite_from_stop psite_region
1: 8 -534 cds
2: 29 -513 cds
3: 29 -513 cds
4: 29 -513 cds
5: 29 -513 cds
---
Some first lines of annotation table
> head(annotation_table)
transcript l_tr l_utr5 l_cds l_utr3
1: ENST00000000233.10 1032 88 543 401
2: ENST00000000412.8 2450 159 834 1457
3: ENST00000000442.11 2274 225 1272 777
4: ENST00000001008.6 3715 170 1380 2165
5: ENST00000001146.7 4556 28 1539 2989
6: ENST00000002125.9 2184 48 1326 810
Thank you so much for your attention!! Daeho
Hi Daeho,
the problem might be that for some reason you didn't take into account the parameter _refseqsep. As I wrote in my previous message, sometimes there are potential issues related to characters which should be removed through the _refseq_sep_ parameter
and the _refseq_sep_ parameter is quite often the culprit for many similar issues.
According to your files, chromosomes are named "chr1 1", "chr2 2" etc. in your FASTA file and "chr1", "chr2" etc. in your GTF. So to discard everything which is redundant in the FASTA file (after the blank space), you should split the string by setting _refseqsep = " " and then keep the first element (this is what refseq_sep is meant for).
In any case, it's not clear to me if you have re-aligned everything with the matching GTF and FASTA file based on the very same annotation or if you just changed the files in _codon_usagepsite. If the latter is the case, I still expect issues because of different files used for the alignment step and in riboWaltz (and, in general, it's not a good practice to use non-matching annotation when mapping the reads).
Best Fabio
Hello Fabio, Sorry for the late reply. I was kind of lost with managing references, but finally solved the problem. "split the string by setting refseq_sep = " " " was the key to resolve my problem.
Thank you again for your kind solutions!
Best Daeho
Hello I am currently trying to generate a graph using command "codon_usage_psite" with following commands.
I have aligned my sequences using STAR (v 2.7.10b) with its option -quantMode TranscriptomeSAM through gencode GRCh38v43 reference.
I generated annotation_table using the same annotation file used while alignments, with following commands.
I've got some warning messages.
Following those steps above, when I run the commands below, I get errors..
I have already checked issue 39 and issue 67, but cannot find a way to solve my problem.
I am using riboWalts v1.2.0 and R version 4.2.3 in platform aarch64-apple (M1 mac).
Could you please help me??
Thank you.
Daeho