Roleren / ORFik

MIT License
32 stars 9 forks source link

ORFikQC - DeSeq dataset collapse Error #169

Closed csittz closed 4 months ago

csittz commented 4 months ago

Hi, was trying to use ORFik for my riboseq analyses, i encounter this error when running ORFikQC.

Using R 4.2.2, ORFik 1.18.2, window10

riboExp <- create.experiment(dir='transcript.bam/',exper='myProject',txdb='gencode.v44.annotation.gtf',fa='Ref/gencode.v44.transcripts.fa',organism = "Homo sapiens", saveDir=NULL, rep=c(1,2,3,1,2,3,1,2,3,1,2,3), condition=rep(c("A","B","A","B"),each=3),stage=rep(c("Cell1","Cell2"),each=6))

df1 <- read.experiment(riboExp)
ORFikQC(df1,BPPARAM = BiocParallel::SerialParam())

Output from R [edited]:

Started ORFik QC report for experiment: myProject Saving, ofst files to: transcript.bam/ofst/ Outputting libraries from: myProject RFP_A_Cell1_r1 RFP_A_Cell1_r2 ... Converting libraries to new format: ofst RFP_A_Cell1_r1 RFP_A_Cell1_r2 ...

Creating read length tables:

_Creating read count tables for region: cds Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK Outputting libraries from: myProject 1: RFP_A_Cell1_r1 ... _Creating read count tables for region: mrna Import genomic features from the file as a GRanges

Making alignment statistics for lib: Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK RFP_A_Cell1_r1 RFP_A_Cell1_r2 ... Could not find raw read counts of data, setting to NA No folder called:transcript.bam/../trim

Making QC plots: _Annotation to NGS libraries plot: Using previously stored readlengths in QC_STATS folder! _Correlation plots raw scaled fpkm (simple) Loading default 80S counts, update count.folder to pshifted if wanted? Error in DESeqDataSet(collapsedAll, design = ~SAMPLE) : all samples have 0 counts for all genes. check the counting script. In addition: There were 47 warnings (use warnings() to see them)

Running ORFikQC with df1[1,] only give Error in covRleFromGR(x, weight = weight, ignore.strand = ignore.strand): Seqlengths of x contains NA values!

txdb <- loadTxdb('gencode.v44.annotation.gtf')
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored.
> seqinfo(txdb)
Seqinfo object with 25 sequences (1 circular) from an unspecified genome; no seqlengths:
  seqnames seqlengths isCircular genome
  chr1           <NA>       <NA>   <NA>
  chr2           <NA>       <NA>   <NA>
... ....
  chrY           <NA>       <NA>   <NA>
  chrM           <NA>       TRUE   <NA>
> txdb
TxDb object:
Db type: TxDb
Supporting package: GenomicFeatures
Data source: gencode.v44.annotation.gtf
Organism: NA
Taxonomy ID: NA
miRBase build ID: NA
Genome: NA
Nb of transcripts: 252835
Db created by: GenomicFeatures package from Bioconductor
Creation time: 2024-04-12 16:21:01 +0800 (Fri, 12 Apr 2024)
GenomicFeatures version at creation time: 1.50.4
RSQLite version at creation time: 2.3.4
DBSCHEMAVERSION: 1.2

Can't run the shiftFootPrint as well

shiftFootprintsByExperiment(df1, accepted.lengths = c(28:32),BPPARAM = BiocParallel::SerialParam()) Shifting reads in experiment: myProject Saving ofst files to:transcript.bam/pshifted/ Saving wig files to: transcript.bam/pshifted/ Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK transcript.bam/ofst/RFP_A_Cell1_r1.toTranscriptome.out.ofst Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK Error: BiocParallel errors 1 remote errors, element index: 1 11 unevaluated and other errors first remote error: Error in .replace_seqlevels_style(x_seqlevels, value): found no sequence renaming map compatible with seqname style "UCSC" for this object In addition: Warning messages: 1: In .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored. 2: In .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored.

Roleren commented 4 months ago

Hi, this is a seqlevel style error I think. Did you use a genome that was not matching the reference ? Can you run and send me seqinfo for:

seqinfo(loadTxdb(df1)) seqinfo(FaFile(df1@fafile))

Do they Match ?

csittz commented 4 months ago

Hi Roleren

thanks. I used STAR to map to hg38.fa and output to transcriptome bam based on to annotation gencode.v44.annotation.gtf.

the fa has more chromosome.

txdb <- loadTxdb('gencode.v44.annotation.gtf')
seqinfo(txdb)
Seqinfo object with 25 sequences (1 circular) from an unspecified genome; no seqlengths:
  seqnames seqlengths isCircular genome
  chr1           <NA>       <NA>   <NA>
  chr2           <NA>       <NA>   <NA>
  chr3           <NA>       <NA>   <NA>
  chr4           <NA>       <NA>   <NA>
  chr5           <NA>       <NA>   <NA>
  ...             ...        ...    ...
  chr21          <NA>       <NA>   <NA>
  chr22          <NA>       <NA>   <NA>
  chrX           <NA>       <NA>   <NA>
  chrY           <NA>       <NA>   <NA>
  chrM           <NA>       TRUE   <NA>
seqinfo(FaFile(df1@fafile))
Seqinfo object with 3366 sequences from an unspecified genome:
  seqnames             seqlengths isCircular genome
  chr1                  248956422       <NA>   <NA>
  chr2                  242193529       <NA>   <NA>
  chr3                  198295559       <NA>   <NA>
  chr4                  190214555       <NA>   <NA>
  chr5                  181538259       <NA>   <NA>
  ...                         ...        ...    ...
  HLA-DRB1*15:01:01:04      11056       <NA>   <NA>
  HLA-DRB1*15:02:01         10313       <NA>   <NA>
  HLA-DRB1*15:03:01:01      11567       <NA>   <NA>
  HLA-DRB1*15:03:01:02      11569       <NA>   <NA>
  HLA-DRB1*16:02:01         11005       <NA>   <NA

is there any parameter to ignore chromosome not found? i tried removing the extra chromosome in the fasta file, now that the seqinfo match, but still face the same error when runing QC.

seqinfo(FaFile(df1@fafile))
Seqinfo object with 25 sequences from an unspecified genome:
  seqnames seqlengths isCircular genome
  chr1      248956422       <NA>   <NA>
  chr2      242193529       <NA>   <NA>
  chr3      198295559       <NA>   <NA>
  chr4      190214555       <NA>   <NA>
  chr5      181538259       <NA>   <NA>
  ...             ...        ...    ...
  chr21      46709983       <NA>   <NA>
  chr22      50818468       <NA>   <NA>
  chrX      156040895       <NA>   <NA>
  chrY       57227415       <NA>   <NA>
  chrM          16569       <NA>   <NA>
Roleren commented 4 months ago

Hi, first of all, you need the original genome fasta which you aligned to.

Else downstream analysis will fail, because you will have ribo-seq reads aligning to a scaffold you removed.

Secondly, lets use the ORFik txdb fixer function:

# Make and save a txdb named, gencode.v44.annotation.gtf.db
ORFik::makeTxdbFromGenome("gencode.v44.annotation.gtf", genome = 'Ref/gencode.v44.transcripts.fa', organism = "Homo sapiens", optimize = TRUE)
# Now remake the experiment with corrected txdb
create.experiment(dir='transcript.bam/',exper='myProject',txdb='gencode.v44.annotation.gtf.db', ...

Now delete the entire /QC_STATS/ folder relative to your 'dir', which was transcript.bam/, to make sure QC is run without any cached results. Now rerun ORFikQC and it should work.

csittz commented 4 months ago

Thanks. I realized i should be using bam files mapped to genome rather than those converted to transcriptome. using the bam files mapped to genome i can run the ORFikQC without error.

Thank you. i shall close this thread.