LabTranslationalArchitectomics / riboWaltz

optimization of ribosome P-site positioning in ribosome profiling data
MIT License
46 stars 12 forks source link

Question about codon_usage_psite, refseq_sep = " " not fixing #82

Closed JessieMinyan closed 6 months ago

JessieMinyan commented 6 months ago

Hello Fabio,

Thanks for developing the amazing riboWaltz package. I'm trying the codon_usage_psite using the following command, but get the error: "Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'subseq': subscript contains invalid names".

I've tried the refseq_sep = " " method in #72 but still got the same error.

example_cu_barplot <- codon_usage_psite(reads_psite_list, annot,
+                                         sample = 'Cont_WT1_star_sortedReAligned.toTranscriptome.out',
+                                         multisamples = "average",
+                                         plot_style = "facet",
+                                         fastapath = "./Mus_musculus.GRCm38.dna.primary_assembly.fa",
+                                         fasta_genome = TRUE,
+                                         gtfpath ="./Mus_musculus.GRCm38.98.gtf",
+                                         frequency_normalization = FALSE)

Some first lines of annotation table

head(annot) transcript l_tr l_utr5 l_cds l_utr3 1: ENSMUST00000000001 3262 141 1065 2056 2: ENSMUST00000000003 902 140 525 237 3: ENSMUST00000000010 2574 85 753 1736 4: ENSMUST00000000028 2143 313 1701 129 5: ENSMUST00000000033 3708 115 543 3050 6: ENSMUST00000000049 1190 51 1038 101

Some first lines of reads_list

head(reads_list[["Cont_WT1_star_sortedReAligned.toTranscriptome.out"]]) transcript end5 end3 length cds_start cds_stop 1: ENSMUST00000123830 108 258 151 0 0 2: ENSMUST00000072769 790 940 151 0 0 3: ENSMUST00000198477 1372 1522 151 0 0 4: ENSMUST00000154460 6180 6330 151 315 1571 5: ENSMUST00000218351 86 236 151 0 0 6: ENSMUST00000188231 317 467 151 0 0

Some first lines of gtf file

!genome-build GRCm38.p6

!genome-version GRCm38

!genome-date 2012-01

!genome-build-accession NCBI:GCA_000001635.8

!genebuild-last-updated 2019-06

1 havana gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; > 1 havana transcript 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; trans> 1 havana exon 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; transcript_ve> 1 ensembl gene 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; gene_version "1"; gene_name "Gm26206"; genesource "ensembl"; gene> 1 ensembl transcript 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; gene_version "1"; transcript_id "ENSMUST00000082908"; trans> 1 ensembl exon 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; gene_version "1"; transcript_id "ENSMUST00000082908"; transcript_ve> 1 ensembl_havana gene 3205901 3671498 . - . gene_id "ENSMUSG00000051951"; gene_version "5"; gene_name "Xkr4"; gene_source "ensembl_ha> 1 havana transcript 3205901 3216344 . - . gene_id "ENSMUSG00000051951"; gene_version "5"; transcript_id "ENSMUST00000162897"; trans>

Some first lines of fasta file from the same release version of gtf

1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF 10 dna:chromosome chromosome:GRCm38:10:1:130694993:1 REF 11 dna:chromosome chromosome:GRCm38:11:1:122082543:1 REF 12 dna:chromosome chromosome:GRCm38:12:1:120129022:1 REF 13 dna:chromosome chromosome:GRCm38:13:1:120421639:1 REF 14 dna:chromosome chromosome:GRCm38:14:1:124902244:1 REF 15 dna:chromosome chromosome:GRCm38:15:1:104043685:1 REF 16 dna:chromosome chromosome:GRCm38:16:1:98207768:1 REF 17 dna:chromosome chromosome:GRCm38:17:1:94987271:1 REF 18 dna:chromosome chromosome:GRCm38:18:1:90702639:1 REF 19 dna:chromosome chromosome:GRCm38:19:1:61431566:1 REF 2 dna:chromosome chromosome:GRCm38:2:1:182113224:1 REF 3 dna:chromosome chromosome:GRCm38:3:1:160039680:1 REF 4 dna:chromosome chromosome:GRCm38:4:1:156508116:1 REF 5 dna:chromosome chromosome:GRCm38:5:1:151834684:1 REF 6 dna:chromosome chromosome:GRCm38:6:1:149736546:1 REF 7 dna:chromosome chromosome:GRCm38:7:1:145441459:1 REF 8 dna:chromosome chromosome:GRCm38:8:1:129401213:1 REF 9 dna:chromosome chromosome:GRCm38:9:1:124595110:1 REF MT dna:chromosome chromosome:GRCm38:MT:1:16299:1 REF X dna:chromosome chromosome:GRCm38:X:1:171031299:1 REF Y dna:chromosome chromosome:GRCm38:Y:1:91744698:1 REF JH584299.1 dna:scaffold scaffold:GRCm38:JH584299.1:1:953012:1 REF GL456233.1 dna:scaffold scaffold:GRCm38:GL456233.1:1:336933:1 REF JH584301.1 dna:scaffold scaffold:GRCm38:JH584301.1:1:259875:1 REF GL456211.1 dna:scaffold scaffold:GRCm38:GL456211.1:1:241735:1 REF

Some first lines of reads_psite_list

head(reads_psite_list) $Cont_WT1_star_sortedReAligned.toTranscriptome.out transcript end5 psite end3 length cds_start cds_stop 1: ENSMUST00000000001 18 32 53 36 142 1206 2: ENSMUST00000000001 18 32 53 36 142 1206 3: ENSMUST00000000001 18 33 54 37 142 1206 4: ENSMUST00000000001 18 33 54 37 142 1206 5: ENSMUST00000000001 18 33 54 37 142 1206

102135241: ENSMUST00000239169 2287 2294 2309 23 24 1295 102135242: ENSMUST00000239169 2981 2992 3004 24 24 1295 102135243: ENSMUST00000239169 2982 2989 3004 23 24 1295 102135244: ENSMUST00000239169 2982 2989 3004 23 24 1295 102135245: ENSMUST00000239169 2982 2989 3004 23 24 1295 psite_from_start psite_from_stop psite_region 1: -110 -1174 5utr 2: -110 -1174 5utr 3: -109 -1173 5utr 4: -109 -1173 5utr 5: -109 -1173 5utr

102135241: 2270 999 3utr 102135242: 2968 1697 3utr 102135243: 2965 1694 3utr 102135244: 2965 1694 3utr 102135245: 2965 1694 3utr

Would you mind kindly help look at this?

Thanks a lot in advance. Best, Jessie

fabiolauria commented 6 months ago

Hi Jessie, thank you for using riboWaltz.

I'm trying to understand from your nicely presented issue where the problem is. As you pointed out, it is usually related to discrepancies in the name of the "objects" in the data structures. More in detail, transcript sequences are extracted from the FASTA file - which includes the sequences of the chromosomes - using the GTF as a reference for the correspondence chromosome-transcripts. These transcript sequences are then used to retrieve, for each transcript reported in the input P-site list, the three nucleotides associated with each P-site. This means that:

  1. the chromosome names should be the same in the FASTA and the GTF;
  2. the transcript names should be the same in the GTF, the P-site list and the annotation data.table.

It looks like point 2) is satisfied since I can see ENSMUST***-like transcript names in the three data structures. Related to point 1) chromosomes are named "1", "2" etc in the GTF and something like "1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF" in the FASTA. However, using ref = " " everything but the characters before the first space " " are kept and the names should then be "1", "2" etc as for the GTF. You can also try using ref = " dna" but I don't think it is going to change.

The only thing I can think of is that even though the FASTA file is from the same release version of the GTF, the two files have been downloaded using distinct sources. If so, no surprise the names differ. If this is not the case, you can try and send me the annotation_dt, the reads_psite_list (with one - or a chunck if it's too big - data.table in it and not all of them) and give me the links for downloading the GTF and the FAST. This way I can try on my own and get back to you as soon as I have an answer.

Let me know what you prefer.

Best Fabio

JessieMinyan commented 6 months ago

Hi Fabio,

Thank you so much for your prompt response!

I tried the ref = " dna" and it didn't change.

The GTF and FASTA file I get are both from Ensembl GRCm38 release-98, using following codes:

wget -c https://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz

wget -c https://ftp.ensembl.org/pub/release-98/gtf/mus_musculus/Mus_musculus.GRCm38.98.gtf.gz

Please find a small-sized example of annotation_dt, reads_psite_list in the attach files.

(Generated from my original annot, and reads_psite_list using the following codes:

reads_psite_list[["Cont_WT1_star_sortedReAligned.toTranscriptome.out"]][sample(.N, 20)]
r=reads_psite_list[["Cont_WT1_star_sortedReAligned.toTranscriptome.out"]]$transcript
r=as.character(r)
annot2=annot[annot$transcript==r,]
annot3=annot[c(1:5),]
annotation_dt=rbind(annot2,annot3)

)

Thanks so much for your kind help! Look forward to further hearing from you. Best, Jessie

JessieMinyan commented 6 months ago

reads_psite_list.RData.zip

annotation_dt.RData.zip

JessieMinyan commented 6 months ago

Dear Fabio,

Thank you so much for all your time and kind attention to the question!!

I found refseq_sep = " " is also in the _codon_usagepsite ! (I wrongly put it in the reads_list generation step by bamtolist). The solution to this question is still refseq_sep = " " (just worked out)!

Thanks so much again!! Best, Jessie