BitSeq / BitSeq

BitSeq code
16 stars 8 forks source link

Support for gene-level estimation and DE, along with documentation #4

Open magnusrattray opened 9 years ago

messersc commented 9 years ago

I would like that, too.

Until then, could somebody sketch what an analysis at the gene level could look like? Any help would be much appreciated.

magnusrattray commented 9 years ago

If you are just estimating gene expression (stage 1) then you can sum up the transcript expression levels in FPKM or TPM units (not counts units - since length normalisation is important). In that case you can also just use our fast VB algorithm rather than MCMC is timing is an issue. If you are doing differential expression (stage 2) you have to do this sum for each MCMC sample in the original BitSeq.

messersc commented 9 years ago

To recap and as a reference for everybody else (correct me if I'm wrong), I will put this here.

For the moment, I am interested in computing logFC to compare to edgeR/limma etc and also qPCR. Differential gene expression might be on the list later, though. Further, I already have done the analysis for transcripts in R, writing (most of?) the files to my working directory (so no temp files). Further, I have compiled BitSeq for use with the command line.

Stage1: Just estimating gene expression, I can work with the results files from getExpression(). I already have transcript estimates from an earlier run (MCMC), but if I was to start from scratch the new VB method would be fine for computing means and logFC.

I have a per-replicate .mean file from my analysis, from reading the documentation I assume this is generated by getExpression() and is equal to the $exp bit of the resulting list. From here, I un-log my rpkm means, sum up the replicates for my 2 conditions and compute the logFC. Right?

Stage2: "Summing each MCMC sample in the original Bitseq" is what getGeneExpression() does, I assume? That gives me a new table with estimates on gene level (~30000 genes x 1000 samples). How do I proceed from there?

mqbssppe commented 9 years ago

Hi Clemens,

just to make sure: have you already used a *.tr file with proper genes-transcripts grouping and then called the getGeneExpression function or not?

Regarding you question 'How do I proceed from there?' I understand that you just want to compute the logFC on the gene-level and not to perform DE analysis. In this case, you could run the 'getVariance' function over the replicates of a given condition (each file should have ~30000 genes x 1000 samples as you described). This will produce a file <*.Lmean> containing the averaged expression (in the first column) across all replicates of the specific condition. Then you will have to do this for the replicates of the second condition. Then you can compute the logFC.

messersc commented 9 years ago

Hello Panagiotis,

yes, I have a trInfo DataFrame/File with proper groupings (queried biomaRt for missing gene names) and I also have succesfully called getGeneExpression(), which gives me the mentioned files (without any file extension, though. ~30000 genes x 1000 samples down from ~80000 transcripts). I will try 'getVariance', thanks for the advice.

Further questions that came up:

Thanks for your help!

petog commented 9 years ago

Hi, the L in .Lmean should stand for logged. So it should be the mean of log expression. .mean files will most likely be just the mean expressions for each replicate.

Yes, there is a mapping. I think you should be able to run all R methods with pretend=TRUE option, in such case, the only output will be the actual shell commands that can be used to get the same results. If you run the methods with same arguments as before, you should see exactly which command produced which files (and also what were the command line arguments).