Questions about codon_usage_psite

zzo-joe commented 1 year ago

Hello I am currently trying to generate a graph using command "codon_usage_psite" with following commands.

I have aligned my sequences using STAR (v 2.7.10b) with its option -quantMode TranscriptomeSAM through gencode GRCh38v43 reference.

I generated annotation_table using the same annotation file used while alignments, with following commands.

annotation_table <- create_annotation(gtfpath = "~/Documents/reference/gencode.v43.chr_patch_hapl_scaff.annotation.gtf")

I've got some warning messages.

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'subseq': subscript contains invalid names

Following those steps above, when I run the commands below, I get errors..

cu_barplot <- codon_usage_psite(reads_psite_list,
                                annotation_table,
                                sample = "RPF2",
                                fastapath = "~/Documents/reference/gencode.v43.transcripts.fa",
                                fasta_genome = FALSE,
                                frequency_normalization = FALSE)

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'subseq': subscript contains invalid names

I have already checked issue 39 and issue 67, but cannot find a way to solve my problem.

I am using riboWalts v1.2.0 and R version 4.2.3 in platform aarch64-apple (M1 mac).

Could you please help me??

Thank you.

Daeho

fabiolauria commented 1 year ago

Hi Daeho, thank you for using riboWaltz.

According to what you wrote, I guess the reference FASTA used in your alignment is made of chromosome sequences, not transcript sequences. Otherwise, the STAR option -quantMode TranscriptomeSAM wouldn't make much sense because the resulting BAM would have already been based on transcript coordinates.

If it's like this, you shouldn't have passed a transcript-based FASTA to _codon_usagepsite, which might lead to mismatches in the name of the sequences and to the error you reported. Instead, you should pass the very same FASTA (chromosome-based) you used in the alignment and set _fastagenome = TRUE .

On the other hand, if you actually aligned on transcript-based FASTA, I would use the "original BAM" (= not the one returned by -quantMode TranscriptomeSAM) from the beginning. This way you are sure no mismatch will arise (except for potential issues related to characters which should be removed through the _refseqsep parameter).

Let me know how it goes.

Best Fabio

zzo-joe commented 1 year ago

Thank you for your kind answer.

Following your suggestion, I have changed "fasta_genome = TRUE" and also added gtfpath.

Still, same error message is showing up..

Here is modified commands.

cu_barplot <- codon_usage_psite(reads_psite_list,
                                annotation_table,
                                sample = "RPF2",
                                fastapath = "~/Documents/reference/GRCh38.p14.genome.fa",
                                fasta_genome = TRUE,
                                gtfpath = "~/Documents/reference/gencode.v43.chr_patch_hapl_scaff.annotation.gtf",
                                frequency_normalization = FALSE)

For STAR commands, I used those fasta and gtf files for generating index (--runMode genomeGenerate) and mapped reads with following commands.

    STAR --genomeDir /mnt/NAS/reference/STAR_gencode_hg38_v43 \
         --runThreadN 16 \
         --readFilesCommand zcat \
         --readFilesIn $PWD/${i}.fastq.gz  \
         --outFileNamePrefix /mnt/NAS/primary_output/${i} \
         --twopassMode Basic \
         --outSAMtype BAM SortedByCoordinate \
         --quantMode TranscriptomeSAM

Thanks, Daeho

fabiolauria commented 1 year ago

Hi Daeho. from what I see there is at least one potential issue in your code. The best way to minimize errors popping up because of discrepancies across GTF and FASTA is to get them from the same version of the annotation. Here, you are using _gencode.v43.chr_patch_haplscaff.annotation.gtf from GENCODE human release 43 and GRCh38.p14.genome.fa from GENCODE human release 44. This might cause your error.

If that is not the case, to understand what is going on I would need to look to a couple of sequence names from the FASTA file as well as the first rows of the GTF, of one element of reads_psite_list and of annotation_dt (as in Issue #69). BTW, keep in mind the _refseqsep parameter I mentioned above, which is quite often the culprit for many similar issues.

Best Fabio

zzo-joe commented 1 year ago

Sorry for the late reply. I am currently going thorough reference files saved in the workstation of our lab. So far, I am pretty sure that I have used gencode v43 for alignment, but not sure why gencode v44 file was included in the folder.. I will re-try with gencode v43 reference directly from the source and post the result soon.

I have changed reference both gtf file and fasta file for v43 and v44 respectively, but got the same error.

> ## Codon usage (E)
> cu_barplot <- codon_usage_psite(reads_psite_list,
+                                 annotation_table,
+                                 sample = "RPF2",
+                                 fastapath = "~/Documents/reference/GRCh38.p13.genome.fa", ## p14
+                                 fasta_genome = TRUE,
+                                 gtfpath = "~/Documents/reference/gencode.v43.chr_patch_hapl_scaff.annotation.gtf", ## v44
+                                 frequency_normalization = FALSE)
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'subseq': subscript contains invalid names

For some row in fasta file

$ cat GRCh38.p13.genome.fa| grep ">" | less

>chr1 1
>chr2 2
>chr3 3
>chr4 4
>chr5 5
>chr6 6
>chr7 7
>chr8 8

~~~
>chrX X
>chrY Y
>chrM MT
>GL000008.2 GL000008.2
>GL000009.2 GL000009.2
>GL000194.1 GL000194.1
>GL000195.1 GL000195.1
>GL000205.2 GL000205.2
>GL000208.1 GL000208.1
>GL000213.1 GL000213.1
>GL000214.1 GL000214.1
>GL000216.2 GL000216.2

For gtf file

$cat gencode.v43.chr_patch_hapl_scaff.annotation.gtf| less

##description: evidence-based annotation of the human genome (GRCh38), version 43 (Ensembl 109)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2022-11-29
chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000290825.1"; gene_type "lncRNA"; gene_name "DDX11L2"; level 2; tag "overlaps_pseudogene";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    11869   12227   .       +       .       gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    12613   12721   .       +       .       gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";

Some lines for read_psite_list

> reads_psite_list$RPF2
                  transcript end5 psite end3 length cds_start cds_stop
       1: ENST00000000233.10   85    97  115     31        89      631
       2: ENST00000000233.10  106   118  135     30        89      631
       3: ENST00000000233.10  106   118  135     30        89      631
       4: ENST00000000233.10  106   118  135     30        89      631
       5: ENST00000000233.10  106   118  135     30        89      631
      ---                                                             
77987585:  ENST00000707138.1 1133  1145 1163     31         1     1536
77987586:  ENST00000707138.1 1133  1145 1163     31         1     1536
77987587:  ENST00000707138.1 1163  1175 1193     31         1     1536
77987588:  ENST00000707138.1 1166  1178 1196     31         1     1536
77987589:  ENST00000707138.1 1498  1510 1528     31         1     1536
          psite_from_start psite_from_stop psite_region
       1:                8            -534          cds
       2:               29            -513          cds
       3:               29            -513          cds
       4:               29            -513          cds
       5:               29            -513          cds
      ---

Some first lines of annotation table

> head(annotation_table)
           transcript l_tr l_utr5 l_cds l_utr3
1: ENST00000000233.10 1032     88   543    401
2:  ENST00000000412.8 2450    159   834   1457
3: ENST00000000442.11 2274    225  1272    777
4:  ENST00000001008.6 3715    170  1380   2165
5:  ENST00000001146.7 4556     28  1539   2989
6:  ENST00000002125.9 2184     48  1326    810

Thank you so much for your attention!! Daeho

fabiolauria commented 1 year ago

Hi Daeho, the problem might be that for some reason you didn't take into account the parameter _refseqsep. As I wrote in my previous message, sometimes there are potential issues related to characters which should be removed through the _refseq_sep_ parameter and the _refseq_sep_ parameter is quite often the culprit for many similar issues. According to your files, chromosomes are named "chr1 1", "chr2 2" etc. in your FASTA file and "chr1", "chr2" etc. in your GTF. So to discard everything which is redundant in the FASTA file (after the blank space), you should split the string by setting _refseqsep = " " and then keep the first element (this is what refseq_sep is meant for).

In any case, it's not clear to me if you have re-aligned everything with the matching GTF and FASTA file based on the very same annotation or if you just changed the files in _codon_usagepsite. If the latter is the case, I still expect issues because of different files used for the alignment step and in riboWaltz (and, in general, it's not a good practice to use non-matching annotation when mapping the reads).

Best Fabio

zzo-joe commented 1 year ago

Hello Fabio, Sorry for the late reply. I was kind of lost with managing references, but finally solved the problem. "split the string by setting refseq_sep = " " " was the key to resolve my problem.

Thank you again for your kind solutions!

Best Daeho

LabTranslationalArchitectomics / riboWaltz

Questions about codon_usage_psite #72