Roleren / ORFik

MIT License
32 stars 9 forks source link

Error in shiftFootprintsByExperiment step #185

Open lundlab opened 1 week ago

lundlab commented 1 week ago

Hi Håkon,

I work in Anders Lund's lab at BRIC in Copenhagen. I’m trying to use your ORFik pipeline, mainly to do QC on a recent RIbo-seq experiment performed on mouse organoids. But maybe also fro some other stuff afterwards. Basically I first want to check that my RPFs have 3nt periodicity before I constrict and sequence the parallel RNA-seq libraries.

Everything seems to be fine until I get to the “shiftFootprintsByExperiment” step. Then I get some errors, for example: "Error in covRleFromGR(x, weight = weight, ignore.strand = ignore.strand): Seqlengths of x contains NA values!”

Could you help me solve this?

I see someone else had a similar problem on Biostars, but the solution is not present in the posts. https://www.biostars.org/p/9552773/

Thanks, Martin Jansson

Roleren commented 1 week ago

Hey, what is R and ORFik version?

lundlab commented 1 week ago

R is 4.4.0 ORFik is 1.26.0

Roleren commented 1 week ago

Try this:

remotes::install_github("Roleren/ORFik")

Then you must restart R to reload package set Then retry your code

Roleren commented 1 week ago

If it does not work send me this: seqinfo(df) df is the ORFik experiment I suspect you have non matching gtf/gff to bam file

lundlab commented 1 week ago

Thanks for the help, I'll send the seqinfo(df) later.

I installed your GitHub ORFik version and restarted as you suggested.

You were right about the GTF, my colleague had modified the standard Gencode mouse version and used that for alignment in STAR.

But with the same fasta & GTF used in STAR, I get a different error:

shiftFootprintsByExperiment(df[df$libtype == "RFP",]) Shifting reads in experiment: SNORD102_RPF Saving ofst files to: /Users/svd569/Desktop/STAR_outputs/Bam_files_genome/pshifted/ Saving wig files to: /Users/svd569/Desktop/STAR_outputs/Bam_files_genome/pshifted/ Warning in call_fun_in_txdbmaker("makeTxDbFromGFF", ...) : makeTxDbFromGFF() has moved to the txdbmaker package. Please call txdbmaker::makeTxDbFromGFF() to get rid of this warning. Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... Warning in .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored. OK /Users/svd569/Desktop/STAR_outputs/Bam_files_genome/Empty2_mergedfile_w_umi_RPF_clean_RPF_star_alignedAligned.sortedByCoord.out.bam Warning in call_fun_in_txdbmaker("makeTxDbFromGFF", ...) : makeTxDbFromGFF() has moved to the txdbmaker package. Please call txdbmaker::makeTxDbFromGFF() to get rid of this warning. Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... Warning in .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored. OK

/Users/svd569/Desktop/STAR_outputs/Bam_files_genome/Empty1_mergedfile_w_umi_RPF_clean_RPF_star_alignedAligned.sortedByCoord.out.bam Warning in call_fun_in_txdbmaker("makeTxDbFromGFF", ...) : makeTxDbFromGFF() has moved to the txdbmaker package. Please call txdbmaker::makeTxDbFromGFF() to get rid of this warning. Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... Warning in .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored. OK

Stop worker failed with the error: wrong args for environment subassignment Error: BiocParallel errors 2 remote errors, element index: 1, 2 4 unevaluated and other errors first remote error: Error in fold(cvg, circle.length): 'circle.length' must be positive

Roleren commented 1 week ago

Hm, fold is not an ORFik function as far as I remember. Can you do in R:

?fold

Then update that package, if it is already updated then find GitHub repo and install like you did with ORFik, if the bug is still there I can tell you how to install older package version (without the bug in it).

The bug is about coverage on circular chromosomes it looks like. There is a small chance something is still wrong with gtf, but let's test this first.

lundlab commented 1 week ago

?fold gives:

S4Vectors internals {S4Vectors} R Documentation S4Vectors internals Description Objects, classes and methods defined in the S4Vectors package that are not intended to be used directly.

[Package S4Vectors version 0.44.0 Index]

. Note this installed S4Vectors is the latest version
lundlab commented 1 week ago

I tried to remove the circular chromosome from the txdb and df with: txdb <- dropSeqlevels(txdb, "chrM", pruning.mode=c("tidy")) df <- dropSeqlevels(df, "chrM", pruning.mode=c("tidy"))

then update the template <- create.experiment, and try shiftFootprintsByExperiment again, but got another unspecified error:

shiftFootprintsByExperiment(df[df$libtype == "RFP",]) Shifting reads in experiment: SNORD102_RPF Saving ofst files to: /Users/svd569/Desktop/STAR_outputs/Bam_files_genome/pshifted/ Saving wig files to: /Users/svd569/Desktop/STAR_outputs/Bam_files_genome/pshifted/ Warning in call_fun_in_txdbmaker("makeTxDbFromGFF", ...) : makeTxDbFromGFF() has moved to the txdbmaker package. Please call txdbmaker::makeTxDbFromGFF() to get rid of this warning. Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... Warning in .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored. OK /Users/svd569/Desktop/STAR_outputs/Bam_files_genome/SNORD102OE3_mergedfile_w_umi_RPF_clean_RPF_star_alignedAligned.sortedByCoord.out.bam Warning in call_fun_in_txdbmaker("makeTxDbFromGFF", ...) : makeTxDbFromGFF() has moved to the txdbmaker package. Please call txdbmaker::makeTxDbFromGFF() to get rid of this warning. Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... Warning in .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored. OK

Stop worker failed with the error: wrong args for environment subassignment Error: BiocParallel errors 0 remote errors, element index: 6 unevaluated and other errors first remote error:

Roleren commented 6 days ago

That said nothing, the error should be after last line "first remote error:"

So lets do the test outside the bplapply, do this:

# Load first ribo-seq lib
rfp_path <- filepath(df[1,], "ofst")
print(rfp_path)
rfp <- fimport(rfp_path)
print(seqinfo(rfp))
# Load txdb 
txdb <- loadTxdb(df)
print(seqinfo(txdb))
print(seqinfo(df))
detectRibosomeShifts(rfp, txdb, verbose = TRUE) # <- This will fail I guess with a better error
lundlab commented 6 days ago

Ok I did that (with the mitochondrial chromosome still included) and got this:

rfp_path <- filepath(df[1,], "ofst") print(rfp_path) [1] "/Users/svd569/Desktop/STAR_outputs/Bam_files_genome/Empty1_mergedfile_w_umi_RPF_clean_RPF_star_alignedAligned.sortedByCoord.out.bam"

rfp <- fimport(rfp_path) print(seqinfo(rfp)) Seqinfo object with 61 sequences from an unspecified genome: NA

Screenshot 2024-11-20 at 11 29 29

Load txdb

txdb <- loadTxdb(df) Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... Warning in .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored. OK print(seqinfo(txdb)) Seqinfo object with 22 sequences (1 circular) from an unspecified genome; no seqlengths: NA

Screenshot 2024-11-20 at 11 31 50

print(seqinfo(df)) Seqinfo object with 61 sequences from an unspecified genome: NA

Screenshot 2024-11-20 at 11 32 41

detectRibosomeShifts(rfp, txdb, verbose = TRUE)

Number of CDSs used for p-site detection: 18032

Distribution of read lengths with reads count > min_reads:

Error in fold(cvg, circle.length) : 'circle.length' must be positive

Roleren commented 6 days ago

Yes, so your txdb is malformed.

I think the df@txdb for you specifies a gtf path and not a txdb path ?

This is a bit sketchy if you have nonstandard STAR alignment. What is safest is always to use a predefined txdb like I do in the tutorial, so do this:

# Make a safe txdb, in ORFIk:::getGenomeAndAnnotation this is done for you:
# Given that your df@txdb is a .gtf path and not .gtf.db (txdb) path
gtf_path <- df@txdb 
genome_fasta <- df@fafile
organism <- organism(df) # Mus musculus if not defined
makeTxdbFromGenome(gtf_path, genome_fasta, organism, optimize = TRUE)

#Now recreate your experiment, with txdb input:
create.experiment(..., txdb = paste0(gtf_path, ".db"), …)
df <- read.experiment(...) # Load new experiment

#Now run the detectRibosomeShifts call again like this:
rfp_path <- filepath(df[1,], "ofst") # Ignore reload if file is large bam and not ofst
rfp <- fimport(rfp_path)
# Load txdb 
txdb <- loadTxdb(df)
stopifnot(identical(seqinfo(txdb), seqinfo(df)))
detectRibosomeShifts(rfp, txdb, verbose = TRUE) # <- This should now work
Roleren commented 6 days ago

A note for speed, I see you have bam files. ORFik has many super optimized formats, so I always run: convert_bam_to_ofst(df) after STAR alignment, to get much faster version of bam file. This speeds up shifting later on etc.

Downstream you have other formats like covRle which are insanely fast for plotting and coverage calculations etc. If this is just a single run test, you can just use the file formats you have, as formats are mostly important when you reload the same file multiple times (for different analysis etc).

lundlab commented 6 days ago

I'll try to make a safe txdb again, I did try that before, but got an error about NNNNNNN in the fasta file.

Actually detectRibosomeShifts seems to work now with the code you gave previously, as long as I take out the ChrM. Can't seem to generate the plots yet though with shiftPlots.


Number of CDSs used for p-site detection: 18032

Distribution of read lengths with reads count > min_reads:

Read length: 26 Top 10 periods from spectrogram, look for period > 2.9 & < 3.1: Read length: 27 Top 10 periods from spectrogram, look for period > 2.9 & < 3.1: Read length: 28 Top 10 periods from spectrogram, look for period > 2.9 & < 3.1: Read length: 29 Top 10 periods from spectrogram, look for period > 2.9 & < 3.1: Read length: 30 Top 10 periods from spectrogram, look for period > 2.9 & < 3.1: Read length: 31 Top 10 periods from spectrogram, look for period > 2.9 & < 3.1: Read length: 32 Top 10 periods from spectrogram, look for period > 2.9 & < 3.1: Read length: 33 Top 10 periods from spectrogram, look for period > 2.9 & < 3.1: Read length: 34 Top 10 periods from spectrogram, look for period > 2.9 & < 3.1: Periodic read lengths: [1] 26 28 29 31 32 34

Coverage of TIS region per read length before filtering

Change point analysis (start): ups (upstream window score), downs (downstream window score) Info / Read length: 26 Possible change points of the max frame with sliding windows: Info / Read length: 28 Possible change points of the max frame with sliding windows: Info / Read length: 29 Possible change points of the max frame with sliding windows: Info / Read length: 31 Possible change points of the max frame with sliding windows: Info / Read length: 32 Possible change points of the max frame with sliding windows: Info / Read length: 34 Possible change points of the max frame with sliding windows:

Roleren commented 6 days ago

So you ran the shiftFootprintsByExperiment step and have the shifted files.

And now you loaded that and tried to run shiftPlots and it fails ?

lundlab commented 6 days ago

No it doesn't seem the pshifted files were written. I just got a gtable with some grobs in it, which I was able to convert into a ggplot object with as.ggplot from ggplotify library.. Not sure if the plots are correct, I'll attach the file here so you can take a look. PeriodicityEmpty&_SNORD102OE.pdf

I'll try now to make the safe txdb and start agin from the beginning. Will let you know how it goes.

Roleren commented 6 days ago

When you have pshifted, check this:

filepath(df, "pshifted")

It should link to ofst files in the ./pshifted/ folder relative to your bam files, if it does not it means it failed.

Roleren commented 6 days ago

Also, if you need help with the genome (the NNNNNNN), send me the error. N in the genome fasta should not matter as far as I know. As most chromosome has NNNNNNN flanks.

Roleren commented 20 hours ago

Any updates ? :)