erhard-lab / price

Improved Ribo-seq enables identification of cryptic translation events
10 stars 0 forks source link

bug: dORF not stopped on TGA #14

Closed freeedog closed 2 years ago

freeedog commented 3 years ago

image

so some dORFs in fact are CDS

florianerhard commented 3 years ago

Dear freeedog, this is very strange. We have never seen something like this for any gene in any data. I checked several data sets here for CCT5, and the ORF always ended at the TGA. Could you kindly provide a data set for us to reproduce this?

freeedog commented 3 years ago

here is the data list: SRX569214 SRX569215 SRX7517156 SRX7517157 SRX7628082 SRX7628083 SRX7628084 SRX7628085 SRX7628101 SRX7628102

here is my command:

STAR --genomeDir $starIndex --outSAMattributes MD NH --alignEndsType EndToEnd --outBAMcompression 10 --outBAMsortingThreadN $nthreads --outFileNamePrefix $prefix --outFilterMismatchNmax 2 --out FilterMultimapNmax 20 --outFilterType BySJout --outMultimapperOrder Random --outSAMtype BAM SortedByCoordinate --outSAMmultNmax 1 --readFilesIn $prefix.clean.fastq --outSAMunmapped Within --outWigStrand Stranded --outWigType bedGraph --quantMode GeneCounts --runThreadN $nthreads --sjdbGTFfile $annotation --sjdbOverhang 29 --alignSJDBoverhangMin 1

gedi -e Bam2CIT -id -p SRX569214.cit SRX569214Aligned.sortedByCoord.out.bam gedi -e Bam2CIT -id -p SRX569215.cit SRX569215Aligned.sortedByCoord.out.bam gedi -e Bam2CIT -id -p SRX7517156.cit SRX7517156Aligned.sortedByCoord.out.bam gedi -e Bam2CIT -id -p SRX7517157.cit SRX7517157Aligned.sortedByCoord.out.bam gedi -e Bam2CIT -id -p SRX7628082.cit SRX7628082Aligned.sortedByCoord.out.bam gedi -e Bam2CIT -id -p SRX7628083.cit SRX7628083Aligned.sortedByCoord.out.bam gedi -e Bam2CIT -id -p SRX7628084.cit SRX7628084Aligned.sortedByCoord.out.bam gedi -e Bam2CIT -id -p SRX7628085.cit SRX7628085Aligned.sortedByCoord.out.bam gedi -e Bam2CIT -id -p SRX7628101.cit SRX7628101Aligned.sortedByCoord.out.bam gedi -e Bam2CIT -id -p SRX7628102.cit SRX7628102Aligned.sortedByCoord.out.bam

gedi -e MergeCIT -c HCT116.merged.cit *.cit

gedi -t /tmp -e ResolveAmbiguities -r HCT116.merged.cit -s HCT116.merged.rescue.csv -o HCT116.merged.rescue.cit -g hg38 -D

gedi -t /tmp -e Price -reads HCT116.merged.rescue.cit -genomic hg38 -prefix HCT116 -plot -percond -D -nthreads 24

florianerhard commented 3 years ago

Dear freedog,

when we process these samples, it looks like this: image

The codon coverage profile we got looks a bit different to yours. However, considering how it is done in the program, I see no reason, why this could change anything (First, ORF candidates are generated from the sequence, and only after this, the read data is utilized).

Unfortunately, the only way how we could find the reason for this behavior is, if you uploaded (i) HCT116.merged.rescue.cit (ii) the price output folder (iii) and the fasta and gtf of your genome somewhere...

freeedog commented 3 years ago

Here is an OneDrive link for what you want: https://xmueducn0-my.sharepoint.com/:f:/g/personal/djch_xmu_edu_cn1/EtMZXaSNTRJBqn7nCjBSATYBZHyzhn4nj0SEf-kFWnswIQ?e=2gr37o

By the way how can I use pipeline if I don't have access to a cluster? Or, how can I use Pipeline on LSF cluster? Could you please share your command line for processing the previous data ?

Thanks

florianerhard commented 3 years ago

Dear freedog,

thank you for all your efforts. However, I still cannot reproduce this error, even with the exact same read data and the exact same genome that you used. This is the result I get when I open the genome browser for your PRICE analyses: grafik

Why is the codon coverage profile different compared to your screenshot above? This is impossible, the genome browser simply renders the values in the files that you uploaded...

And this is what I get after running PRICE myself on your mapped reads and your genome: grafik

My only explanation here is that something was wrong with your genome files when you called PRICE.