Closed JessieMinyan closed 6 months ago
Hi Jessie, thank you for using riboWaltz.
I'm trying to understand from your nicely presented issue where the problem is. As you pointed out, it is usually related to discrepancies in the name of the "objects" in the data structures. More in detail, transcript sequences are extracted from the FASTA file - which includes the sequences of the chromosomes - using the GTF as a reference for the correspondence chromosome-transcripts. These transcript sequences are then used to retrieve, for each transcript reported in the input P-site list, the three nucleotides associated with each P-site. This means that:
It looks like point 2) is satisfied since I can see ENSMUST***-like transcript names in the three data structures. Related to point 1) chromosomes are named "1", "2" etc in the GTF and something like "1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF" in the FASTA. However, using ref = " " everything but the characters before the first space " " are kept and the names should then be "1", "2" etc as for the GTF. You can also try using ref = " dna" but I don't think it is going to change.
The only thing I can think of is that even though the FASTA file is from the same release version of the GTF, the two files have been downloaded using distinct sources. If so, no surprise the names differ. If this is not the case, you can try and send me the annotation_dt, the reads_psite_list (with one - or a chunck if it's too big - data.table in it and not all of them) and give me the links for downloading the GTF and the FAST. This way I can try on my own and get back to you as soon as I have an answer.
Let me know what you prefer.
Best Fabio
Hi Fabio,
Thank you so much for your prompt response!
I tried the ref = " dna" and it didn't change.
The GTF and FASTA file I get are both from Ensembl GRCm38 release-98, using following codes:
wget -c https://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
wget -c https://ftp.ensembl.org/pub/release-98/gtf/mus_musculus/Mus_musculus.GRCm38.98.gtf.gz
Please find a small-sized example of annotation_dt, reads_psite_list in the attach files.
(Generated from my original annot, and reads_psite_list using the following codes:
reads_psite_list[["Cont_WT1_star_sortedReAligned.toTranscriptome.out"]][sample(.N, 20)]
r=reads_psite_list[["Cont_WT1_star_sortedReAligned.toTranscriptome.out"]]$transcript
r=as.character(r)
annot2=annot[annot$transcript==r,]
annot3=annot[c(1:5),]
annotation_dt=rbind(annot2,annot3)
)
Thanks so much for your kind help! Look forward to further hearing from you. Best, Jessie
Dear Fabio,
Thank you so much for all your time and kind attention to the question!!
I found refseq_sep = " " is also in the _codon_usagepsite ! (I wrongly put it in the reads_list generation step by bamtolist). The solution to this question is still refseq_sep = " " (just worked out)!
Thanks so much again!! Best, Jessie
Hello Fabio,
Thanks for developing the amazing riboWaltz package. I'm trying the codon_usage_psite using the following command, but get the error: "Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'subseq': subscript contains invalid names".
I've tried the refseq_sep = " " method in #72 but still got the same error.
Some first lines of annotation table
Some first lines of reads_list
Some first lines of gtf file
!genome-build GRCm38.p6
!genome-version GRCm38
!genome-date 2012-01
!genome-build-accession NCBI:GCA_000001635.8
!genebuild-last-updated 2019-06
1 havana gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; > 1 havana transcript 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; trans> 1 havana exon 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; transcript_ve> 1 ensembl gene 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; gene_version "1"; gene_name "Gm26206"; genesource "ensembl"; gene> 1 ensembl transcript 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; gene_version "1"; transcript_id "ENSMUST00000082908"; trans> 1 ensembl exon 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; gene_version "1"; transcript_id "ENSMUST00000082908"; transcript_ve> 1 ensembl_havana gene 3205901 3671498 . - . gene_id "ENSMUSG00000051951"; gene_version "5"; gene_name "Xkr4"; gene_source "ensembl_ha> 1 havana transcript 3205901 3216344 . - . gene_id "ENSMUSG00000051951"; gene_version "5"; transcript_id "ENSMUST00000162897"; trans>
Some first lines of fasta file from the same release version of gtf
Some first lines of reads_psite_list
Would you mind kindly help look at this?
Thanks a lot in advance. Best, Jessie