EMBL-Hentze-group / DEWSeq

R/Bioconductor package for e/iCLIP data analysis
5 stars 1 forks source link

Does count/matrix file dimension have to match annothenion file dimension? #7

Closed LoneKnightz closed 1 year ago

LoneKnightz commented 1 year ago

Hi,

I followed the htseq-clip + Dewseq pipeline to process clip data.

But it gave erroe when use DESeqDataSetFromSlidingWindows:

Error in SummarizedExperiment(assays = SimpleList(counts = countData), : the rownames and colnames of the supplied assay(s) must be NULL or identical to those of the RangedSummarizedExperiment object (or derivative) to construct

Is this because countData (1462654 3) dimension not match annotationData (65312564 12)?

I get all count/ matrix files and sliding windowed annotation files from HTseq-clip.

Is there any way to fix this ?

Thanks

Distue commented 1 year ago

Hi,

Yes, count files and annotation files must be on the same order and have the same rows.

I would recommend using or following this Rmd:

https://github.com/EMBL-Hentze-group/DEWSeq_analysis_helpers/blob/a24a7b7e9f0d2c80f8ce8ed06f3e8630a8ecebe8/Parametrized_Rmd/analyseStudy.Rmd

R Block Sample Sanity contains a check.

Hope this helps. If you have any questions, let us know.

Tom Schwarzl

LoneKnightz @.***> schrieb am So., 4. Dez. 2022, 00:07:

Hi,

I followed the htseq-clip + Dewseq pipeline to process clip data.

But it gave erroe when use DESeqDataSetFromSlidingWindows:

Error in SummarizedExperiment(assays = SimpleList(counts = countData), : the rownames and colnames of the supplied assay(s) must be NULL or identical to those of the RangedSummarizedExperiment object (or derivative) to construct

Is this because countData (1462654 3) dimension not match annotationData (65312564 12)?

I get all count/ matrix files and sliding windowed annotation files from HTseq-clip.

Is there any way to fix this ?

Thanks

— Reply to this email directly, view it on GitHub https://github.com/EMBL-Hentze-group/DEWSeq/issues/7, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG2FZF4T25A5QVZHE3C643WLPHDDANCNFSM6AAAAAASS7RFU4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

LoneKnightz commented 1 year ago

Hi,

Thanks for confirm.

In this case, may I ask dou you know how should I change the Htseq-clip command I used to genarate count/matrix that have same denmision as annotation files? These are the command I used:

for annotation

htseq-clip annotation -g /work/prh169/work/genome/GRCm39/gencode/gencode.vM31.chr_patch_hapl_scaff.annotation.gff3 -o /work/prh169/work/genome/GRCm39/gencode/gencode.vM31.chr_patch_hapl_scaff.annotation.bed htseq-clip createSlidingWindows -i /work/prh169/work/genome/GRCm39/gencode/gencode.vM31.chr_patch_hapl_scaff.annotation.bed -o /work/prh169/work/genome/GRCm39/gencode/gencode.vM31.chr_patch_hapl_scaff.annotation.bed.window htseq-clip mapToId -a /work/prh169/work/genome/GRCm39/gencode/gencode.vM31.chr_patch_hapl_scaff.annotation.bed.window -o /work/prh169/work/genome/GRCm39/gencode/gencode.vM31.chr_patch_hapl_scaff.annotation.bed.name

for count

htseq-clip extract -c 75 -i ../bam_gencode/${filename}Aligned.sortedByCoord.out.bam -e 1 -o ../crosslink_sites/${filename}crosslink.sites.bed htseq-clip count -i ../crosslink_sites/${filename}crosslink.sites.bed -a /work/prh169/work/genome/GRCm39/gencode/gencode.vM31.chr_patch_hapl_scaff.annotation.bed.window -o ../count/${filename}count.txt

Thanks

sudeepsahadevan commented 1 year ago

Hi LoneKnightz Could you please post the first ~10-15 lines of your countData , annotObj files and the entire colData data.frame ? This looks like a SummarizedExperiment Error i.e, colnames(CountData) and rownames(colData) being inconsistent.

LoneKnightz commented 1 year ago

HI sudeepsahadevan, This is the countData

countData_MM[1:15,] unique_id MM-IP_S60_count MM-Input_S61_count 1: ENSMUSG00000064842.3:exon0001W00003 4 6 2: ENSMUSG00000064842.3:exon0001W00004 4 6 3: ENSMUSG00000051951.6:intron0003W00229 1 0 4: ENSMUSG00000051951.6:intron0003W00230 1 0 5: ENSMUSG00000051951.6:intron0003W00231 1 0 6: ENSMUSG00000051951.6:intron0001W17342 2 0 7: ENSMUSG00000051951.6:intron0001W17343 2 0 8: ENSMUSG00000051951.6:intron0001W17344 2 0 9: ENSMUSG00000102343.2:intron0003W01626 1 0 10: ENSMUSG00000102343.2:intron0003W01627 1 0 11: ENSMUSG00000102343.2:intron0003W01628 1 0 12: ENSMUSG00000025900.14:intron0026W05652 1 0 13: ENSMUSG00000025900.14:intron0026W05653 1 0 14: ENSMUSG00000025900.14:intron0014W12048 1 0 15: ENSMUSG00000025900.14:intron0014W12049 1 0

This is the annotationData

annotationData[1:15,] unique_id chromosome begin end strand gene_id gene_name gene_type 1: ENSMUSG00000102693.2:exon0001W00001 chr1 3143475 3143525 + ENSMUSG00000102693.2 4933401J01Rik TEC 2: ENSMUSG00000102693.2:exon0001W00002 chr1 3143495 3143545 + ENSMUSG00000102693.2 4933401J01Rik TEC 3: ENSMUSG00000102693.2:exon0001W00003 chr1 3143515 3143565 + ENSMUSG00000102693.2 4933401J01Rik TEC 4: ENSMUSG00000102693.2:exon0001W00004 chr1 3143535 3143585 + ENSMUSG00000102693.2 4933401J01Rik TEC 5: ENSMUSG00000102693.2:exon0001W00005 chr1 3143555 3143605 + ENSMUSG00000102693.2 4933401J01Rik TEC 6: ENSMUSG00000102693.2:exon0001W00006 chr1 3143575 3143625 + ENSMUSG00000102693.2 4933401J01Rik TEC 7: ENSMUSG00000102693.2:exon0001W00007 chr1 3143595 3143645 + ENSMUSG00000102693.2 4933401J01Rik TEC 8: ENSMUSG00000102693.2:exon0001W00008 chr1 3143615 3143665 + ENSMUSG00000102693.2 4933401J01Rik TEC 9: ENSMUSG00000102693.2:exon0001W00009 chr1 3143635 3143685 + ENSMUSG00000102693.2 4933401J01Rik TEC 10: ENSMUSG00000102693.2:exon0001W00010 chr1 3143655 3143705 + ENSMUSG00000102693.2 4933401J01Rik TEC 11: ENSMUSG00000102693.2:exon0001W00011 chr1 3143675 3143725 + ENSMUSG00000102693.2 4933401J01Rik TEC 12: ENSMUSG00000102693.2:exon0001W00012 chr1 3143695 3143745 + ENSMUSG00000102693.2 4933401J01Rik TEC 13: ENSMUSG00000102693.2:exon0001W00013 chr1 3143715 3143765 + ENSMUSG00000102693.2 4933401J01Rik TEC 14: ENSMUSG00000102693.2:exon0001W00014 chr1 3143735 3143785 + ENSMUSG00000102693.2 4933401J01Rik TEC 15: ENSMUSG00000102693.2:exon0001W00015 chr1 3143755 3143805 + ENSMUSG00000102693.2 4933401J01Rik TEC gene_region Nr_of_region Total_nr_of_region window_number 1: exon 1 1 1 2: exon 1 1 2 3: exon 1 1 3 4: exon 1 1 4 5: exon 1 1 5 6: exon 1 1 6 7: exon 1 1 7 8: exon 1 1 8 9: exon 1 1 9 10: exon 1 1 10 11: exon 1 1 11 12: exon 1 1 12 13: exon 1 1 13 14: exon 1 1 14 15: exon 1 1 15

This is coldata

colData type MM-IP_S60_count IP MM-Input_S61_count Input

Thanks

sudeepsahadevan commented 1 year ago

Hi LoneKnightz,

Thanks for the data. Please be aware that 1 IP vs 1 Input comparison is not optimal for DEWSeq, and that you may have reproducibility issues.

What output do you get for class(countData_MM) ? Is it [1] "data.frame" OR [1] "tbl_df" "tbl" "data.frame" ?

In the first case, please use the column unique_id in countData_MM as rownames(countData_MM) and remove this column from the data.frame.

In the second case, use the option tidy=TRUE for DESeqDataSetFromSlidingWindows as described in the DEWSeq manual

Distue commented 1 year ago

Hi LoneKnightz,

Unfortunately, DEWSeq (the underlying DESeq2) cannot handle 1 vs 1 comparisons, because one cannot estimate variance and therefore not really perform meaningful statistics. Explorative analysis can be done when you use DESeq2 (create a DESeqDataSet object), filter low count windows, then do VST transformation (varianceStabilizingTransformation) on it and then calculating IP - SMI (which is equivalent to plotting IP vs SMI sample and looking for deviations from the 45 degree axis). In your case, you will be only interested in values > 0 because you look for enrichments in IP.

Alternatively, and probably the better solution for this would be to use RCRUNCH (https://www.biorxiv.org/content/10.1101/2022.07.06.498949v1) or after you inspect your data and see beautiful peaks, use PUREclip or CLIPper which do 1 vs 1 comparisons.

Sorry for the inconvenience, we will make sure to highlight the minimal requirements in the vignette.

If you need any more help, please let us know. Tom

On Tue, Dec 6, 2022 at 8:23 AM Sudeep Sahadevan @.***> wrote:

Hi LoneKnightz https://github.com/LoneKnightz,

Thanks for the data. Please be aware that 1 IP vs 1 Input comparison is not optimal for DEWSeq, and that you may have reproducibility issues.

What output do you get for class(countData_MM) ? Is it [1] "data.frame" OR [1] "tbl_df" "tbl" "data.frame" ?

In the first case, please use the column unique_id in countData_MM as rownames(countData_MM) and remove this column from the data.frame.

In the second case, use the option tidy=TRUE as described in the DEWSeq manual https://bioconductor.org/packages/release/bioc/manuals/DEWSeq/man/DEWSeq.pdf

— Reply to this email directly, view it on GitHub https://github.com/EMBL-Hentze-group/DEWSeq/issues/7#issuecomment-1338897753, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG2FZHQ5CIKCRR44IGYXI3WL3SX5ANCNFSM6AAAAAASS7RFU4 . You are receiving this because you commented.Message ID: @.***>

LoneKnightz commented 1 year ago

Hi Distue

Thanks for reply.

I know that DESeq2 can only handle case with replicates. The result I am having is the prelimary result and they want to test some idea on this. If the result is good, they may sequence more as replicates.

In this way, I have to make sure this pipeline is working so it will be easy for future use.

Thanks for remind. I will try RCRUNCH or the tools you mentioned.

LoneKnightz commented 1 year ago

Hi sudeepsahadevan The output of countData_MM is

class(countData_MM) [1] "data.table" "data.frame"

I use as.data.frame convert countData_MM to data.frame and use unique_id as rownames (and delete unique_id column). but there is new error :

Error in `.rowNamesDF<-`(x, value = value) : duplicate ‘row.names‘ are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘0’, ‘1’, ‘10’, ‘100’, ‘101’, ‘102’, ‘103’, ‘10321’, ‘104’, ‘105’, ‘106’, ‘10601’, ‘107’, ‘10738’, ‘109’, ‘11’, ‘110’, ‘111’, ‘111188’, ‘112’, ‘113’, ‘114’, ‘11505’, ‘11525’, ‘11592’, ‘116’, ‘117’, ‘118’, ‘119’, ‘12’, ‘120’, ‘12060’, ‘121’, ‘12176’, ‘122’, ‘123’, ‘123414’, ‘125’, ‘127’, ‘128’, ‘1287’, ‘13’, ‘130’, ‘131’, ‘132’, ‘133’, ‘134’, ‘135’, ‘137’, ‘1399’, ‘14’, ‘143’, ‘144’, ‘145’, ‘146’, ‘148’, ‘149’, ‘15’, ‘150’, ‘1502’, ‘151’, ‘1520’, ‘153’, ‘154’, ‘1566’, ‘157’, ‘159’, ‘16’, ‘160’, ‘162’, ‘16241’, ‘163’, ‘1649’, ‘166’, ‘167’, ‘17’, ‘170’, ‘173’, ‘17300’, ‘174’, ‘1750’, ‘1754’, ‘176’, ‘18’, ‘184’,  [... truncated] 

If I convert annotationData rownames to unique_id and delete unique_id column , the errors are the same.

I feel like this mismatch caused by Htseq-clip count. Htseq-clip count may not have row for samples have no count in this slide windowed annotation . So that caused size diff between annotation and count files.

Thanks