BrooksLabUCSC / flair

Full-Length Alternative Isoform analysis of RNA
Other
203 stars 69 forks source link

Feature request: define group for diffExp contrast #260

Closed skchronicles closed 1 year ago

skchronicles commented 1 year ago

Hello there,

I hope you are having a great day, and that all is going well on your side! I just wanted to start this off by saying thank you for creating such an awesome, useful tool. Flair is pretty amazing!

I was about to run the diffExp module; however, I did not see a way to control/define the contrast. I see in the diffSplice module there are options to define conditionA and conditionB, where I assume the resulting contrast would be conditionA-conditionB. I took a peek at how you are currently doing this within the diffExp module, and it looks like it is dictated by the order the groups are parsed from the header of the quant module.

image

Would it be possible to allow the user to define the groups up front where you can pass this information to runDE to build a contrast? This would allow a user to control how the contrast is created, which is useful for interpreting the fold change later.

I also have one last question. I see that you are enforcing that a given group has at least 3 replicates. Would it be possible to decrease this to two replicates? I know DESeq2 should work with 2 replicates within a given group. I know... I know, it's not ideal and in a perfect world you would plan experiments with more biological replicates; however, sometimes issues arise that prevent you from using one replicate (due to technical and/or quality-control reasons). Please feel free to shoot this last request down. Realistically, it may be better to enforce 3 replicates per group to enforce better experimental design.

Please let me know what you think and if this could be incorporated in a later release.

Best Regards, @skchronicles

callumparr commented 1 year ago

Seems removing the requirement for 3 replicates is something you could easily change by altering the deFLAIR.py script and save as a local branch in your git repository.

skchronicles commented 1 year ago

Yeah, I was thinking about that, but I was also checking to see if you wanted to natively support 2 replicates per group moving forward. I understand if you don't think it is worthwhile. I am not sure if DRIM-seq enforces the three replicates per group but I know you can get away with 2 replicates per group with DESeq2/limma/edgeR.

callumparr commented 1 year ago

Yeah, I was thinking about that, but I was also checking to see if you wanted to natively support 2 replicates per group moving forward. I understand if you don't think it is worthwhile. I am not sure if DRIM-seq enforces the three replicates per group but I know you can get away with 2 replicates per group with DESeq2/limma/edgeR.

Sorry to clarify I have nothing to do with FLAIR. Just I tried to use it previously and have been following updates as I had issues getting it to work on several environments.

I have used DRIMseq and there is no prevention of only using 2 biological replicates. That goes for DEXseq, DTUseq, StageR, isoformSwitchAnalzyer etc. Of course, more replicates are better but as you say we have to work with what we have and can get within reason.

It seems you can simply remove the error coding to prevent users from only supply 2 replicates.

Jeltje commented 1 year ago

The math underlying most differential expression analysis software was developed to deal with very few samples, and three per group is simply the absolute minimum. Think of it this way: if the two samples disagree on something, which one is right? There's no way to tell. The programs may work with that kind of input, their output will be near meaningless. In addition, these tools were created for short reads and expect much higher coverage than you get with nanopore or Pacbio.

DiffExp is a convenience function that makes several assumptions about the input, such as the way the samples are named (sample underscore group underscore batch) and only allowing two groups. The code that dealt with multiple groups didn't work so I had to remove it. We are not currently planning on adding more groups, but we may be able to provide example R code with Flair.

callumparr commented 1 year ago

While I fully agree 2 is not an ideal case but you can get out reproducibility from two replicates. In the context of biological replicates, you could easily have two values in disagreement, there is true variation in biology, separate from technical variation. I think this is accounted for in methods like DEseq. This is countered by reducing the power. So you limit yourself to the highest fold changes. I do not think its correct to say a returned normalized log FC from two replicates is meaningless. https://support.bioconductor.org/p/106403/